Methods, software arrangements, storage media, and systems for genotyping or haplotyping polymorphic genetic loci or strain identification

ABSTRACT

The present invention relates generally to systems, methods, storage media, and software arrangements for genotyping and/or haplotyping a sequence of polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample or identifying a strain variant from the DNA sample. Exemplary embodiments of systems, methods, storage media, and software arrangements may perform the optimization of the design of one or more microarrays, each containing a set of oligonucleotide probes capable of detecting one or more known genotypes and/or haplotypes at given polymorphic genetic loci or identifying the strain variant, by optimizing the set of oligonucleotides to be incorporated into the microarrays and by optimizing the arrangement of a set of oligonucleotides on the microarrays. The optimization may be achieved through the application of one or more optimization procedures. The instant invention may be useful in typing individuals at the HLA loci or other polymorphic genetic loci, or may be employed to quickly identify viral or bacterial pathogens from which genome sequence information is available.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a national stage application of PCT Application No.PCT/US2004/024766 which was filed on Aug. 1, 2004 and published inEnglish on Feb. 10, 2005 as International Publication No. WO 2005/013091(the “International Application”), This application also claims priorityfrom U.S. Patent Application Ser. No. 60/492,210, filed on Aug. 1, 2003(the “'210 Application”). The entire disclosures of these applicationsare incorporated herein by reference. This application claims priorityfrom the International Application pursuant to 35 U.S.C. §365, and fromthe '210 Application pursuant to 35 U.S.C. §§119(e) and 365.

FIELD OF THE INVENTION

The present invention relates generally to systems, methods, andsoftware arrangements for genotyping or haplotyping polymorphic geneticloci or strain identification.

BACKGROUND OF THE INVENTION

The human leukocyte antigen (“HLA”) region on chromosome 6 is highlypolymorphic. In particular, the sequence of this region varies fromperson to person. See, e.g., Consolandi et al., Human Immunology 2003,64:168. Approximately 1,750 different sequence variants or “alleles”have been identified to date at the HLA locus. There are many biologicalimplications of the high degree of heterogeneity of this region. Forexample, the presence or absence of a given HLA allele or “HLA type” maypredict the presence or absence of diseases, dictate the course oftreatment for a patient, or, most notably, determine the compatibilityof a potential transplant recipient with the donor organ or bone marrow.

One of the approaches to finding the right allele is to design amicroarray experiment that provides the allele as an answer. In fact,HLA typing by sequence hybridization with sequence-specificoligonucleotide probes (“SSOP”) is currently practiced by the NationalMarrow Donor Program (“NMDP”) for donor-recipient matching, along withmore traditional serological-based methods. See, e.g., Cao et al.,Reviews in Immunogenetics 1999, 1:177; and Noreen et al., TissueAntigens 2001, 57:221. In a format that is popular in many current testmethods, the DNA samples to be classified are amplified withlocus-specific primers, and spotted onto the microarray chips, thusresulting in multiple copies of identical chips; each chip is thenhybridized to a different probe. See Balazs et al., Human Immunology2001, 62:850; Consolandi et al., Human Immunology 2003, 64:168. Thismethodology necessitates a new design process every time a new set ofpatient samples must be classified. Moreover, most of the currently usedtechniques are both time-consuming and lack optimality.

The system, process, storage medium and software arrangement accordingto one exemplary embodiment of present application provides a graphmodel on the set of potential probes in which the HLA typing problem isformulated mathematically as an optimization problem. According to thepresent application, it is also possible to utilize an algorithm forsolving the optimization problem. The processes of translating thetyping problem to the graph model and translating the optimizing probeset back to an experimental design for HLA typing are also described.Extensions of the graph model to more detailed physical models arediscussed as well.

SUMMARY OF THE INVENTION

Embodiments of systems, processes, storage media and softwarearrangements for genotyping or haplotyping the polymorphic genetic locior strain identification according to the present invention may optimizethe design of one or more microarrays, each containing a set ofoligonucleotides that are capable of detecting known genotypes orhaplotypes at given polymorphic genetic loci. This can be done byoptimizing the set of oligonucleotides to be incorporated into themicroarrays and/or by optimizing the arrangement of a set ofoligonucleotides on the microarrays. Such optimization may also beachieved through the application of one or more optimization algorithms.The present invention may be useful in typing individuals at the HLAloci or other polymorphic genetic loci, and/or may be employed toquickly identify viral or bacterial pathogens from which genome sequenceinformation is available.

In contrast to the conventional systems and methods according to thepresent invention, the sequence-specific probes may be placed on amicroarray chip, and each patient sample can be applied to the chip toallow the hybridization with one or more of the chip-bound probes tooccur. With an appropriate selection of probes, the same chip can beused for all classifications. However, the number of probes to be used,their sequence compositions, and their arrangement on the chip are someof the design variables that should preferably be considered inpreparing the microarray. A general solution to solving these designproblems is preferably one that allows the “recognition” of all existingalleles at a target locus, and/or that can decide that the given DNAsequence contains an allele that is not in the “known” list. Such anallele may be a new, previously unknown allele, or one of the very rarealleles that occur so infrequently that they are not considered HLAtypes. For example, an exemplary embodiment of the present invention isdirected toward systems, processes, storage media and softwarearrangements for genotyping or haplotyping a DNA sample at one or morepolymorphic loci contained in the sample through the use of microarrayμA, which can be defined by a set of oligonucleotide hybridizationprobes configured on the surface of the microarray in a two-dimensionalarrangement. The process of querying a given polymorphic locus in a DNAsample (hereafter referred to as a “target” sequence) by a hybridizationexperiment can be denoted by the expression(T_(j),μA_(k))→D→{circumflex over (T)}_(j)  (1)where T_(j) is the true allele contained in the target sequence, μA_(k)is the microarray used in the query, D is the data output of thehybridization experiment, and {circumflex over (T)}_(j) is the alleleinferred from the data. Both processes in equation (1) are describedbelow.

The problem of genotyping or haplotyping then can be formulated as thatof designing, e.g., the “best” microarray, namely, the set andarrangement of oligonucleotide probes that “works” for all known alleles(i.e., ∀j). In the notation employed herein, this means finding μA_(k)which solves the optimization

$\begin{matrix} {\min\;{\sum\limits_{{type}\mspace{14mu} j}{w_{j}\;{E\lbrack \Pi_{T_{j} \neq {\hat{T}}_{j}} \rbrack}}}}\Leftrightarrow{\min\;{\sum\limits_{{type}\mspace{14mu} j}{w_{j}\;{{\Pr( {T_{j} \neq {\hat{T}}_{j}} )}.}}}}  & (2)\end{matrix}$Here, Π_(X) is the indicator function

$\Pi_{X} = \begin{Bmatrix}{1,} & {{if}\mspace{14mu} X\mspace{14mu}{is}\mspace{14mu}{true}} \\{0,} & {otherwise}\end{Bmatrix}$and w_(j) is the weight assigned to type j. Initially, w_(j) can be setsuch that w_(j)=1∀j. At a later point it may be desirable to weighdifferent HLA types differently, based on the frequency of theiroccurrence in human population or some other criteria.

According to another exemplary embodiment of the present invention, apseudocode can be provided to select an optimal set of oligonucleotideprobes to be incorporated into the one or more microarray devices to beused to genotype or haplotype polymorphic loci or identify the strainpresent in a DNA sample. According to this exemplary embodiment, vertexboosting weights (initially set to probe information weights) can beused to define a probability distribution on a vertex set present in agraphical representation of “response vectors” derived from eachpotential probe sequence. On each iteration of a boosting loop, a randomsubset of a specified size can be selected according to the currentprobability distribution. All edges in the induced subgraph on thisrandom subset are broken, with one of the terminal vertices removed. Theboosting weights of the elements of the subset can then be modified sothat the vertices that stayed in the subset are more likely, and thevertices that were thrown out are less likely to be chosen on the nextiteration. The boosting loop may be terminated after a predeterminednumber of iterations have been performed without further improvement tothe list of top independent sets. In addition, the boosting loop can berestarted several times with original probe information weights, whichprevents convergence to a local optimum.

The exemplary embodiments of the systems, processes, storage media andsoftware arrangements of the present invention are generally morebeneficial in comparison to conventional methods in that they requirefewer probes, thereby minimizing the cost associated with the use ofsuch probes. The exemplary embodiments of the systems, processes,storage media and software arrangements of the present invention arealso preferable to conventional methods in that they can minimizecompetition among neighboring probes, thereby reducing or eliminatingthe occurrence of systematic biases in the error process.

For a better understanding of the present invention, together with otherand further objects, reference is made to the following description,taken in conjunction with the accompanying drawings, and its scope willbe pointed out in the appended claims.

DETAILED DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention and itsadvantages, reference is now made to the following description, taken inconjunction with the accompanying drawings, in which:

FIG. 1A is an exemplary embodiment of a system according to the presentinvention for determining an optimal set of oligonucleotide probes foruse in a microarray designed to perform genotyping or haplotyping ofpolymorphic genetic loci;

FIG. 1B is an exemplary embodiment of a procedure according to thepresent invention for determining an optimal set of oligonucleotideprobes for use in a microarray designed to perform genotyping orhaplotyping of polymorphic genetic loci;

FIG. 2 is a first exemplary embodiment of a probe selection procedure(Shown in FIG. 1B) of the present invention for determining an optimalset of oligonucleotide probes for use in a microarray designed toperform genotyping or haplotyping of polymorphic genetic loci;

FIG. 3 is a second exemplary embodiment of the probe selection procedureshown in FIG. 1B of the present invention for determining an optimal setof oligonucleotide probes for use in a microarray designed to performgenotyping or haplotyping of polymorphic genetic loci, in whichexemplary embodiments of certain steps of FIG. 2 are furtherillustrated.

DETAILED DESCRIPTION OF THE INVENTION

An exemplary embodiment of the present invention is directed to systems,processes, storage media and software arrangements for genotyping orhaplotyping polymorphic genetic loci (e.g., an HLA locus, in a subject)or strain identification. HLA haplotyping or genotyping may assist in,e.g., a prediction of a presence or absence of diseases, selecting acourse of treatment for a patient, and/or determining the compatibilityof a potential transplant recipient with the donor organ or bone marrow.The systems, processes, storage media and software arrangements of thepresent invention also may be useful for e.g., identifying the presenceof an unknown pathogen, including but not limited to a virus or abacterium, in a sample.

By providing a relatively rapid, inexpensive, and potentially highlyaccurate way of genotyping or haplotyping polymorphic genetic loci orstrain identification, the exemplary systems, processes, storage mediaand software arrangements of the present invention also can be useful inelucidating genotype/phenotype correlations in complex geneticdisorders, i.e., those in which more than one gene may play asignificant role, or in genetic diseases characterized by the presenceof a relatively large number of genotypes or haplotypes at polymorphicloci. The knowledge obtained from the exemplary systems, processes,storage media, and software arrangements of the present inventiontherefore may assist in facilitating the diagnosis, treatment, andprognosis of individuals bearing a given phenotype.

FIG. 1A shows an exemplary embodiment of a system according to thepresent invention for determining a particular set of oligonucleotideprobes for a use in a microarray designed to perform genotyping and/orhaplotyping of at least one polymorphic genetic locus. For example, thesystem includes a processing arrangement 10 (e.g., a computer), whichstores a computer program on a storage arrangement 15 (e.g., memory,hard drive, etc.) to execute the exemplary techniques described hereinbelow. In particular, the computer program, when executed by theprocessing arrangement 10, causes the processing arrangement to obtainknown sequences representing classes, e.g., HLA allele sequences, froman external source 20 or any source (as described below). Then theprocessing arrangement 10 applies a probe selection technique accordingto the present invention. Thereafter, the processing arrangement 10outputs the results of the execution of this technique.

FIG. 1B illustrates a first exemplary embodiment of a process fordetermining an optimal set of oligonucleotide probes for use in amicroarray designed to perform genotyping or haplotyping of polymorphicgenetic loci or strain identification which can use the exemplary systemshown in FIG. 1A. In this exemplary embodiment, the process executes afirst step 100 in which nucleotide sequences of known genotypes orhaplotypes at predetermined polymorphic loci, or strain-specificsequences, may be obtained. Such information can be obtained from anumber of sources. For example, this data can be obtained from geneticdatabases including, but not limited to, GenBank and/or other databasesmaintained by public or private (commercial and noncommercial)institutions. The databases suitable for use with the systems,processes, storage media and software arrangements of the presentinvention will be readily apparent to those of ordinary skill in the artof DNA sequencing. Alternatively, such information may-be obtained fromthe prior direct, sequencing of one or more test samples.

Once a set of known sequences corresponding to genotypes or haplotypesat polymorphic genetic loci or strains is assembled, the probe selectionprocedure 200 of the present invention may be applied to this set toidentify an optimal subset of oligonucleotide probes required togenotype or haplotype the polymorphic genetic loci or identify thestrain. This can be executed by the processing arrangement of FIG. 1A.The optimization of the set of oligonucleotide probes may include, butis not limited to, the determination of the overall lengths of thevarious probes, their sequence compositions, and the minimum number ofprobes required to identify all selected target genotypes or haplotypesor strains. Many if not all selected target genotypes or haplotypes orstrains may include all known genotypes or haplotypes at predeterminedpolymorphic genetic loci or strains, or a subset thereof. For example,the identification may be limited only to those alleles that are mostprevalent in the population from which the DNA sample to be genotypedand/or haplotyped has been obtained.

In another step of the process for determining the optimal set ofoligonucleotide probes of FIG. 1B, an optimal subset of oligonucleotideprobes preferred for genotyping and/or haplotyping polymorphic geneticloci or strain identification may be generated by the probe selectiontechnique 200 and processing arrangement 10. The optimal probe set 300may then be arranged on one or more microarray devices to determine thegenotype or haplotype of polymorphic genetic loci or identify the strainpresent in a DNA sample.

Another exemplary embodiment of the present invention further relates tothe optimal arrangement of the optimal set of oligonucleotide probes onthe one or more microarray devices. The exemplary arrangement of theprobes may be based on one or more “conceptual” measures of probedistance. The “conceptual” probe distance can use available biologicalknowledge about the portions of the genome containing the probes inquestion, as well as a measure of competition between these probes, asdescribed in Chapter 3 of Cherepinsky, Ph.D. Thesis, New York University2003. An exemplary recursive technique according to the presentinvention can facilitate a separation between the “nearest” probe pairsby at least a specified minimum physical distance on the chip. Thisoptimized arrangement can be designed to be disruptive to neighborhoods,defined with respect to the “conceptual” probe distance. Thus, theexemplary arrangement can place conceptually nearby probes far apart onthe surface, thereby minimizing the likelihood for competitiveinteractions between neighboring probes.

FIG. 2 illustrates a first exemplary embodiment of the probe selectionalgorithm 200 of the present invention for determining the optimal setof oligonucleotide probes. In this exemplary embodiment, the biologicalproblem to be solved, e.g., designing an optimal set of probes to beused on a microarray for determining the genotype or haplotype ofpolymorphic genetic loci, such as the human HLA region, or strain,variation may first be translated into a graph optimization problem 210.The graph optimization problem can then be solved (item 220) through anapplication of the exemplary technique according to the presentinvention. The graph optimization solution 230 can then be translatedback into the context of the biological problem to be solved.

FIG. 3 shows a second exemplary embodiment of the probe selectiontechnique (see process 200 of FIG. 1B) according to the presentinvention for determining the optimal set of oligonucleotide probes, inwhich the exemplary embodiments of the processes 210 and 220 of FIG. 2are further illustrated. In this exemplary embodiment, selected targetgenotypes and/or haplotypes, which may include all known genotypes orhaplotypes at one or more predetermined polymorphic genetic loci or asubset thereof, depending on whether pre-processing has been performed,may be used to generate potential probe sequences 211. The potentialprobe sequences 211 can then be used to generate probe response vectors(PRVs) 212. Using the PRVs 212, a complete edge-weighted andvertex-weighted graph G=(V,E) 213 may be constructed and the algorithmparameters estimated and set. Said parameters include the following: M,the upper bound on the size of the independent set sought; a, thescaling factor used to modify boosting weights of vertices on eachiteration of the algorithm; ρ, the edge threshold; parametersminRestartNum and noImprovements, used to set loop terminationconditions; as well as unnamed parameters such as the size of the“current-best” list of independent sets. Criteria for estimating M areobtained via probabilistic analysis, discussed in section C.6 of presentapplication. Other parameters are chosen by trial and error.

Once the graph G is constructed, the exemplary technique of the presentinvention may be applied to identify one or more optimal subsets of PRVs(step 221), which are output from this exemplary technique. In apost-processing procedure, one or more of the candidate optimal subsets222 can be selected for a maximum discriminatory power by testing allelecoding vectors for redundancy.

The PRVs present in the optimal subset may then be converted back intoprobe nucleotide sequences by reporting the DNA sequence associated withthe probe used to generate each PRV contained in the optimal-subset(item 231). This exemplary procedure may be equivalent to translatingthe solution of the graph optimization problem back into a biologicalcontext.

Overall Process Description According to the Present Invention

A. Mathematical Formulation

1. Definitions

Let the different HLA types, or alleles, be denoted by T_(j), j=1, . . ., N. (Here, N=1750, the approximate number of known HLA types.) Let agiven microarray be denoted by μA_(k),k ε

, where a microarray is defined by a set of hybridization probes andtheir two-dimensional arrangement on the chip surface. The process ofquerying the given DNA sequence (hereafter referred to as a “target”sequence) by hybridization can be denoted by the expression(T_(j),μA_(k))→D→{circumflex over (T)}_(j)  (1)where {circumflex over (T)}_(j) is the true allele contained in thetarget sequence, μA_(k) is the microarray used in the query, D is thedata output of the hybridization experiment, and {circumflex over(T)}_(j) is the allele inferred from the data. Both processes in (1) aredescribed below.

The problem of HLA typing can then be formulated as that of designingthe best microarray, namely, the set and arrangement of probes, which“works” for all known HLA types (i.e., ∀j). In the present notation,this can mean finding μA_(k) which solves the optimization problem

$\begin{matrix} {\min\;{\sum\limits_{{type}\mspace{14mu} j}{w_{j}\;{E\lbrack \Pi_{T_{j} \neq {\hat{T}}_{j}} \rbrack}}}}\Leftrightarrow{\min\;{\sum\limits_{{type}\mspace{14mu} j}{w_{j}\;{{\Pr( {T_{j} \neq {\hat{T}}_{j}} )}.}}}}  & (2)\end{matrix}$Here, Π_(X) is the indicator function

$\Pi_{X} = \begin{Bmatrix}{1,} & {{if}\mspace{14mu} X\mspace{14mu}{is}\mspace{14mu}{true}} \\{0,} & {otherwise}\end{Bmatrix}$and w_(j) is the weight assigned to type j. Initially, w_(j) can be setsuch that w_(j)=1∀j. At a later point, it may be desirable to weighdifferent HLA types differently, based on the frequency of theiroccurrence in human population or some other criteria.

There are several procedures that should be considered in detail:obtaining data D from an experiment based on allele T_(j) and microarrayμA_(k) (described below in Section A.2), inferring allele {circumflexover (T)}_(j) from the data (described below in Section A.3), generatingpotential microarrays for the typing experiments (described below inSection A.4), and selecting the optimal microarray (Section B).

2. (T_(j),μA_(k))→D

Consider the set of probes {P₁,_(k), . . . , P_(n(k),k)} constitutingmicroarray μA_(k), initially neglecting their arrangement. Ideally, theoutcome of the hybridization of the target sequence with each probe Pwould be binary: 1, if the target contains a subsequence complementaryto P, and 0, otherwise. Using n=n(k) probes then yields a binary stringof length n, or, alternately, a vector of length n, as a code for thetarget sequence. In practice, hybridization results may not necessarilybe binary. In particular, the measurements are the analog intensityvalues corresponding to the amount of formed probe-target complex foreach probe. In addition, in an attempt to “factor out” the non-specificsignal, each probe can often be present in two versions: one (e.g., aperfect match, or “pm”) perfectly complementary to a region on thetarget, and the other (e.g., a mismatch, or “mm”) slightly mismatched,the latter usually containing a single base mismatch near the center ofthe probe. This is the case, for example, in Affymetrix GeneChips. Insuch setup, the signal from probe P may be the match-to-mismatch ratio,i.e., the ratio of the intensities corresponding to the matched andmismatched probe-target complexes. Furthermore, the signal can belog-transformed, so that the hybridization outcome for probe P is reallythe value of

$\log( \frac{{TP}_{pm}}{{TP}_{mm}} )$

The situation may further be complicated because the probes mayhybridize to positions on the target other than those they were designedto detect—this is known as “cross-hybridization.” In addition, the factthat many probes are present in the system may cause the signal (i.e.,the hybridization outcome) from a given probe to differ from the signalof the same probe in the absence of other probes. This is described inmore detail in Cherepinsky, Ph.D. Thesis, New York University 2003,Chapter 3.

Thus, the actual result of a hybridization experiment is a vector of nmeasurements, D ε

^(n).

3. D→{circumflex over (T)}_(j)

To infer the allele from the n-vector, the ideal process should bereferred to again, where the outcome is D ε {0,1}^(n). If the probes areselected in such a way so as to provide a distinct binary string foreach known allele (so that the Hamming distance d_(H) between any pairof data vectors D_(i), D_(k) is at least 1), then these n probes can besufficient to identify the allele of the target sequence. What ispreferred is to query the sequence with the n probes and read off theallele to which the pattern corresponds. Furthermore, if it is requiredthat d_(H) (D_(i), D_(k))≧α for some α>1, then the discrimination powercan be increased and error-correction is possible. This is discussed ingreater detail below (e.g., Section B.6).

In a practical setting, D ε

^(n). Thus, as a first procedure, some thresholding process must beapplied to D to reduce it to a binary string.

4. Generating Potential Microarrays

-   -   a. Selection of Informative Probes. A set of n probes, each of        length L, that are at least d letters apart (pairwise), must be        provided for optimal discrimination among the allele sequences.

If L is not specified, it can be chosen arbitrarily (say, 20), orallowed to vary from probe to probe.

With no restrictions, a very large n can be chosen; for example, everypossible 20-mer could be used as a probe. However, this would result in4²⁰=2⁴⁰=(2¹⁰)⁴>(10³)⁴=10¹², or over a trillion, probes. Such a large setmay not be desirable, since many of these probes would give the sameresults, and it is too expensive to produce all of them. Allowing both nand L to vary may give an even larger number of potential probesequences.

The probe design problem relates to selecting which of the probes aremost useful in discriminating among the given allele sequences, and howmany (or rather, how few) one can use appropriately.

-   -   b. Arrangement of Probes on the Chip. When a set of probes has        been selected using techniques described herein above, a        question still remains of how to arrange these probes on the        microarray chip. Several studies indicate that the patterns        observed in the results of chip experiments may be due to the        arrangement of probes on the chip. See e.g., Kluger et al.,        submitted to Nature Genetics, 2004 at URL        http://bioinfo.mbb.yale.edu/˜kluger/pipeline/KLUGERetal_NG.pdf;        Yu et al., submitted to Nature Biotechnology, 2004 at URL        http://bioinfo.mbb.yale.edu/˜kluger/pipeline/YQK_NB.pdf; and        Qian et al., submitted to Biotechniques, 2004 at URL        http://bioinfo.mbb.yale.edu/˜kluger/pipeline/QYK_artifact.pdf.        In particular, it has been observed that the probes are arranged        on a chip based on the labels of the genes they represent, and a        gene label is often related to the function and/or disorder in        which the gene is involved. As a result, genes of shared        function have similar labels and are coexpressed, generating        monochromatic bands on microarray chip scans.

These studies indicate that additional consideration should be given tothe arrangement of the probes on the chip, based on certain “conceptual”measure of probe distance. The “conceptual” probe distance can useavailable biological knowledge about the portions of the genomecontaining the probes in question, as well as a measure of competitionbetween these probes, as discussed Chapter 3 of Cherepinsky, Ph.D.Thesis, New York University 2003. The recursive technique describedbelow can ensure that the “nearest” probe pairs would be separated by atleast a specified minimum physical distance on the chip. It is designedto be disruptive to neighborhoods, defined with respect to the“conceptual” probe distance. Thus, the exemplary technique placesconceptually nearby probes far apart on the surface.

Consider a bijective function f: {0, . . . , N²−1}→{0, . . . ,N−1}×{0, .. . ,N−1} that maps every pair of “nearby” points in the domain space toa pair of “distant” points in the range space. In particular, thefollowing devised function ƒ with the following property should beconsidered: For every x, y, if |x−y|≦4^(α), then∥ƒ(x)−ƒ(y)∥₁≧N/(2^(α+1)). This function likely gives an optimalplacement. If the elements of the domain space satisfy other distanceproperties, this technique can be suitably generalized to handle similarproperties with respect to the new distance metric.

This function can play an important role in determining how to place aset of oligonucleotide probes on a microarray surface in such a mannerthat if two probes are close to one another in their genome locationsthen they are reasonably far apart on the array. Thus, a placementdetermined by the function can minimize the competition among the probesfor the genomic targets, as well as the systematic biases in the errorprocesses.

Inductively, a uniform family of functions ƒ_(k) may be defined asfollows. Let k<lg N.

$\begin{matrix}{f_{k + 1}: \{ {0,\ldots\mspace{11mu},{N^{2} - 1}} \}arrow{\{ {0,\ldots\mspace{11mu},{N - 1}} \} \times \{ {0,\ldots\mspace{11mu},{N - 1}} \}} } \\{: x\mapsto{\langle {i,j} \rangle.} }\end{matrix}$f_(k+1) is defined in terms of f_(k),ƒ_(k):{0, . . . ,N²/4−1}→{0, . . . ,N/2−1}×{0, . . . ,N/2−1},as follows:

$\begin{matrix}{{f_{k + 1}(x)} = \begin{Bmatrix}{{f_{k}( \lbrack \frac{x}{4} \rbrack )},} & {{{if}\mspace{14mu} x} \equiv {0\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{{f_{x}( \lbrack \frac{x}{4} \rbrack )} + \langle {0,\frac{N}{2}} \rangle},} & {{{if}\mspace{14mu} x} \equiv {1\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{{f_{x}( \lbrack \frac{x}{4} \rbrack )} + \langle {\frac{N}{2},0} \rangle},} & {{{if}\mspace{14mu} x} \equiv {2\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{{f_{x}( \lbrack \frac{x}{4} \rbrack )} + \langle {\frac{N}{2},\frac{N}{2}} \rangle},} & {{{if}\mspace{14mu} x} \equiv {3\mspace{14mu}{mod}\mspace{14mu} 4}}\end{Bmatrix}} & (3)\end{matrix}$

This function can be generalized, without its general properties beingaffected, by including a random permutation π_(k+1): {0, . . . ,3}→{0, .. . ,3} as follows:

${f_{k + 1}(x)} = \begin{Bmatrix}{{f_{k}( \lbrack \frac{x}{4} \rbrack )},} & {{{if}\mspace{14mu} x} \equiv {{\pi_{k + 1}(0)}\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{f_{k}( \lbrack \frac{x}{4} \rbrack )} + \langle {0,\frac{N}{2}} \rangle} & {{{if}\mspace{14mu} x} \equiv {{\pi_{k + 1}(1)}\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{f_{k}( \lbrack \frac{x}{4} \rbrack )} + \langle {\frac{N}{2},\frac{N}{2}} \rangle} & {{{if}\mspace{14mu} x} \equiv {{\pi_{k + 1}(2)}\mspace{14mu}{mod}\mspace{14mu} 4}} \\{{f_{k}( \lbrack \frac{x}{4} \rbrack )} + \langle {\frac{N}{2},\frac{N}{2}} \rangle} & {{{if}\mspace{14mu} x} \equiv {{\pi_{k + 1}(3)}\mspace{14mu}{mod}\mspace{14mu} 4}}\end{Bmatrix}$

Herein below, k may take the value (lg N−1), and the base case is givenby the functionƒ₂:{0, . . . ,15}→{0, . . . ,3}×{0, . . . ,3}where0

0,0

1

0,2

2

2,0

3

2,2

4

0,1

5

0,3

6

2,1

7

2,3

8

1,0

9

1,2

10

3,0

11

3,2

12

1,1

13

1,3

14

3,1

15

3,3

  (4)This base map can be described in matrix format as follows:

$\begin{matrix}\begin{bmatrix}0 & 4 & 1 & 5 \\8 & 12 & 9 & 13 \\2 & 6 & 3 & 7 \\10 & 14 & 11 & 15\end{bmatrix} & (5)\end{matrix}$Taking N=2^(l), N²=4^(l) probes can be placed by applying ƒ_(l): {0, . .. ,4^(l)−1}, which after l −2 recursive steps, defined in equation (3),may reduce to the base case ƒ₂: {0, . . . ,15} shown in equations (4)and (5).

Function ƒ_(l) can have the following distance properties. Let D(i, j)be the distance between probes p_(i) and p_(j), when arrayed on a line(by relabeling the probes, one can view this as the index separation|i−j|). Let d(i, j) be their distance when arrayed on the surface. Then,the mapping ƒ_(l) may guarantee that for all p_(i), p_(j) for which D(i,j)≦4^(k), d(i, j)≧2^(l−k−1), where k=0, . . . ,l−1. Furthermore, if d(i,j)=1, that is p_(i) is placed next to p_(j) on the surface, then D(i, j)≧3·4^(l−2).

For example, let l=3. There are N²=4^(l)=64 probes, which are placed byƒ₃ according to

$\begin{bmatrix}0 & 16 & 4 & 20 & \; & 1 & 17 & 5 & 21 \\32 & 48 & 36 & 52 & \; & 33 & 49 & 37 & 53 \\8 & 24 & 12 & 28 & \; & 9 & 25 & 13 & 29 \\40 & 56 & 44 & 60 & \; & 41 & 57 & 45 & 61 \\\; & \; & \; & \; & \; & \; & \; & \; & \; \\2 & 18 & 6 & 22 & \; & 3 & 19 & 7 & 23 \\34 & 50 & 38 & 54 & \; & 35 & 51 & 39 & 55 \\10 & 26 & 14 & 30 & \; & 11 & 27 & 15 & 31 \\42 & 58 & 46 & 62 & \; & 43 & 59 & 47 & 63\end{bmatrix}\quad$The probe distances

$\begin{matrix}k & D & d \\0 & 1 & 4 \\1 & 4 & 2 \\2 & 16 & 1\end{matrix}\quad$likely satisfy the condition that if D≦4^(k), thend≧N/2^(k+1)=2^(l−k−1).

The problem of an automatic generation of probe sets for DNA microarraysis described in Krause et al., Second IEEE International Workshop onHigh Performance Computational Biology (HiCOMB 2003), online proceedingsat URL http://hpc.eece.unm.edu/HiCOMB/proceedings.html. However, thework described by Krause et al. aims for a probe set that is, even inideal circumstances, asymptotically much larger than the one generatedby the exemplary embodiments of the present invention.

Other biological problems, such as identifying an unknown pathogen as amember of a list of known pathogens, be they viral (see Rash andGusfield, in Proceedings of the Sixth Annual International Conference onComputational Biology (RECOMB '02), ACM Press, pp. 254-261) or bacterial(see Borneman et al., Bioinformatics 2001, 17:S39), likely have the samemathematical formulation as the problem of HLA typing discussed here.Those applications can also benefit from the improvements to existingapproaches provided by the exemplary embodiments of the presentinvention.

B. Definitions

The exemplary problem of selecting the optimal microarray is describedbelow. The problem of selecting the constituent probes can be reduced toa “best independent set” problem. The following sections can define thegraph model employed, as well as the meaning of the term “bestindependent set,” and describe the optimizing algorithm.

1. Notation

There are N known alleles, and n potential probes. Each probe can bedescribed by a “response vector” {right arrow over (v)}_(j ε{)0,1}^(N),j=1, . . . ,n. The response vector data can be represented in tabularform:

$\begin{matrix}{\begin{matrix}\; & {\overset{arrow}{v}}_{1} & {\overset{arrow}{v}}_{2} & \cdots & {\overset{arrow}{v}}_{n} \\{HLA}_{1} & 1 & 0 & \cdots & 1 \\{HLA}_{2} & 1 & 1 & \cdots & 0 \\\vdots & \vdots & \vdots & \; & \vdots \\{HLA}_{N} & 0 & 0 & \cdots & 1\end{matrix}\quad} & (6)\end{matrix}$Column j is the response vector for probe {right arrow over (v)}_(j):{right arrow over (v)} _(j)=(v _(j)[1], v _(j)[2], . . . ,v _(j)[N])^(T),  (7)and row i is the code for allele i, which we will call HLA_(i):HLA _(i)=(v ₁ [i], v ₂ [i], . . . ,v _(n) [i])  (8)

2. Original Graph

Each potential probe can form a vertex in the graph. Conceptually, anedge in the graph should connect two probes with shared characteristics.

First, essentially a complete edge- and vertex-weighted undirected graphG=(V,E) on n vertices is constructed, where n is the number of potentialprobes. In a general problem formulation, n can be very large. For eachprobe length L, there are likely 4^(L) possible probes. For instance, asshown in above in Section A.4, there may be over a trillion possibleprobes of length 20. Thus, the graph in the probe interaction model canbe very large.

Second, weights are assigned to each vertex v and to each edge e,0≦w(v), w(e)≦1. The weight of a vertex can be initially set to the“information content” of the corresponding probe response vector (PRV)with respect to the HLA typing problem:w(v)=min {percentage of 0's, percentage of 1's}/100.   (9)The term “information content” can be used here differently than definedin information theory, where it may mean the minimum amount ofinformation needed to send a string. See e.g., Cover and Thomas.Elements of Information Theory. John Wiley and Sons, New York. 1991.Ideally, if possible, all vertices should have weight 0.5. A vertex witha weight too close to zero can be uninformative, and the correspondingprobe should only be used if it serves to differentiate an allele thatis not distinguishable by using other, more informative probes.

The weight of an edge is initially set to the scaled Hamming distance ofthe probe response vectors represented by its endpoints:w(e)=Hamming distance/vector length  (10)with values close to zero corresponding to sequence-similar probes.

3. Thresholded Graph

Third, the graph G is transformed by thresholding the edges. A thresholdρ, is selected and used to generate a modified graph G_(mod)=(V,E_(mod)), whereE _(mod) ={e ε E: w(e)≦ρ}is a set of unweighted edges, and the set of weighted vertices V isunchanged. The choice of threshold ρ will be discussed in further detailbelow in section D. Hereafter, this modified vertex-weighted graph isemployed and denoted by G.

An independent set can be defined on the graph. An independent set isdefined as a set of vertices such that for any pair of vertices, thereis no edge between them. See Dictionary of Algorithms and DataStructures at NIST at URL http://www.nist.gov/dads/. In particular, aset of vertices V′⊂V can be an independent set if (u ε V′ and v ε V′)implies that {u, v} ∉ E.

4. Exemplary Goal

The best microarray can be defined above in Section A. The correspondingconcept is now formulated on the graph model. The best independent setcan be defined as a maximum weight yet minimum size independent set.Thus, such a set S ⊂ V must be an independent set, have maximum weightw(S)=Σ_(vεS)w(v), and minimum cardinality |S|. The condition ofindependence is meant to preclude any unintended interaction among thechosen probes. Maximum weight provides S with maximum discriminationpower. Minimum size likely ensures that the smallest collection ofprobes is used to perform the preferred functions.

Since all vertex weights are nonnegative, the requirements of maximumweight and minimum cardinality are clearly contradictory. The definitioncan be relaxed somewhat by specifying a priori the desired size M of theset, and looking instead for the maximum weight independent set of size≦M.

C. Optimization Procedure

To achieve this goal, a modification of a Maximal Independent Setprocedure, described in Luby, Proceedings of the Seventh Annual ACMSymposium on Theory of Computing (STOC '85), pp. 1-10, is utilized.

1. Procedure Pseudocode

Given graph G and set size M:

-   -   a) Initialization:        -   (i) Initialize a “current-best” list of independent sets,            with associated information weights. It stores a list of the            best, say, 20, independent sets seen so far, sorted by            information weight.    -   b) Restart Loop: Execute at least minRestartNum times; if the        “current-best list” is not full (i.e., does not have 20        independent sets) by then, keep repeating until the list is        filled.        -   (i) Initialize boosting weights; this was accomplished by            setting the boosting weights to the information weights of            the vertices:            ∀v ε V, w_(b)(v)←w(v).        -   (ii) Boosting Loop: Repeat until no improvements have been            made to the “current-best” list for a fixed number of            iterations (say, five iterations).            -   aa. Choose a set S of M vertices randomly from V, with

${P( {v \in S} )} = {\frac{w_{b}(v)}{\sum\limits_{u \in V}{w_{b}(u)}}.}$

-   -   -   -   bb. For each edge {u,v} in G|_(s) (the induced subgraph                on S), eliminate one of the endpoint vertices. This                leaves a set, S₁, of K≦M independent vertices.            -   cc. Adjust the boosting weights of vertices in S; for                example, increase the boosting weights of the vertices                in S₁:                w_(b)(v)←a w_(b)(v) ∀v ε S₁            -    and decrease the boosting weights of the vertices in                S-S₁:

$\begin{matrix} {w_{b}(v)}arrow{\frac{1}{a}{w_{b}(v)}}  & {\forall{v \in {S - S_{1}}}}\end{matrix}$

-   -   -   -    where a≧1 is some previously selected constant.            -   dd. If S₁ is not already in the “current-best” list and                provides an improvement over some current member of the                list, reset the noImprovements counter, locate an                appropriate location for S₁ in the list, and update the                list. Otherwise, make a note that no changes to                “current-best” list were made on this iteration (ie.,                increment the noImprovements counter).

        -   (iii) If the condition for continuing the restart loop holds            (namely, minRestartNum restarts have not yet been executed            or the “current-best” list is not yet full), reset the            noImprovements counter and repeat step (b).

2. Procedure Description

The procedure can utilize vertex boosting weights (e.g., initially setto probe information weights) to define a probability distribution onthe vertex set. On each iteration of the boosting loop (step b.iiabove), a random subset of a specified size can be selected according tothe current probability distribution (step b.ii.aa). All edges in theinduced subgraph on this random subset may be broken, with one of theterminal vertices removed (step b.ii.bb). The boosting weights of theelements of the subset can then be modified (step b.ii.cc), so that thevertices that stayed in the subset are more likely, and the verticesthat were removed are less likely to be selected on the next iteration.The boosting loop can terminate after a certain number of iterationswith no improvement to the list of top independent sets (to allow someflexibility, the algorithm keeps track of several of the top independentsets instead of only storing the best one seen so far).

The procedure can also restart the boosting loop several times withoriginal probe information weights. This feature can be used to preventconvergence to a local optimum, which is possible for high values of theboosting factor.

3. Detailed Explanation of Procedure

The boosting procedure (as a whole) can be viewed as operating on theprobability space of all subsets of our graph. Step b.ii.bb providesthat the selected subset is independent, so that the probabilitydistribution can only be supported on independent sets (i.e., thedistribution is zero on all non-independent sets). In this view, theprocedure can converge to a probability distribution where the bestindependent set has the highest probability. Each iteration of theboosting loop adjusts the probabilities associated with each vertex inthe graph. The subset of interest is always drawn randomly according tothe current probability distribution.

If the solution S* is known a priori, its selection by the procedure canpossibly be guaranteed by initializing the boosting weights in step b.ito be∀v ε S* , w_(b)(v)←1,∀v ε V−S*, w_(b)(v)←0.

In other words, the associated probability distribution may have aprobability of 1 for each vertex v ε S*, and a probability of 0 for eachremaining vertex v ε V−S*.

Given an unlimited time for obtaining a solution (and an appropriate setof parameters), the boosting procedure can converge to this idealdistribution. However, when time is limited, a “good” (i.e.,informative) independent set of size ≦M is only “more likely” than otherindependent sets of similar size. The procedure is provided to give aneffective solution in a limited time, and yet be able to improve on ititeratively when more time is permitted (with minimal modifications tothe loop terminating conditions).

For example, the best independent set may be, ideally, a fixed point forthe algorithm, in the sense that if the procedure starts at a perturbedlocation in the subset probability space, it should converge to theoptimal set. In particular, if the initial probability distribution isheavily favored towards a set that does not differ from the best set inmany vertices, the procedure likely converges to the best set.

4. Breaking Edges

In step b.ii.bb of the boosting procedure can keep the vertex that hasthe higher boosting weight. If vertices have equal boosting weights, oneat random (with probability ½) can be selected.

5. Selecting Scaling Factor a

In step b.ii.cc of the boosting algorithm, the weights of those verticesthat were selected and kept are boosted (scaled up) by a factor of a≧1,while the weights of discarded vertices are scaled down by the samefactor. This has the effect of noting which vertices are selected formembership in the independent set, and increasing the likelihood thatthese vertices will be selected in the future, with the reverse effecton the discarded vertices. The manner in which the value of the scalingfactor a affects the “memory” of the probability space evolution isdiscussed below. A single “restart” of the procedure (namely, step b.ii)is described.

a) Extreme cases:

-   -   a=1: No memory of previous selections. Ignore the current        selection, and choose anew on the next iteration. The boosting        procedure performs an exhaustive search.    -   a=∞: Perfect memory. Once a set S of vertices is selected and        pruned, and its elements' boosting weights are modified, each of        the vertices remaining in the independent set S₁ will have a        boosting weight of ∞ and each of those thrown out of the        independent set due to conflicts will have a boosting weight        of 0. Thereafter, the boosting procedure will always choose the        independent set selected on the first run.

b) Real values:

The boosting procedure is executed on the same graph model with severalvalues of a ε {2, 1.5, 1.2, 1.1}. The executions with higher values of awere observed to terminate a single “restart” after a smaller number ofiterations than those with lower values of a. Values of a can be chosenby many methods, known to those with ordinary skill in the art.

6. Choosing M: the Maximum Size of the Independent Set

This section contains a probabilistic analysis of an answer to thefollowing question: What are the bounds on the number of probes, k, thatis sufficient to distinguish N known alleles? In order to answer thisquestion, certain assumptions can be made on the random distributionfrom which the known alleles are assumed to be drawn.

a) Similar PRV Entries

Assume that each probe, at each index i=1, . . . ,N, assumes values 0and 1 independently and with equal probability. Consider k such probesand two alleles (HLA_(l) and HLA_(m)). Thus, if HLA_(l) is fixed:HLA _(l)=(HLA _(l)[1], . . . ,HLA _(l) [k]),then for each j,Pr(HLA _(m) [j]=HLA _(l) [j])=½Pr(HLA _(m) [j]≠HLA _(l) [j])=½then, for these two HLA vectors,

$\begin{matrix}{{{\Pr\mspace{14mu}( {{{The}\mspace{14mu}{Hamming}\mspace{14mu}{distance}\mspace{14mu}{between}\mspace{14mu} 2\mspace{14mu}{HLA}\mspace{14mu}{vectors}} = x} )} = {\begin{pmatrix}k \\x\end{pmatrix}2^{- k}}},} & (11)\end{matrix}$which can easily be seen as follows:

$\begin{matrix}{{\Pr\mspace{14mu}\begin{pmatrix}{{{{The}\mspace{14mu}{Hamming}\mspace{14mu}{dist}\mspace{14mu}{bet}}’}n} \\{{2\mspace{14mu}{HLA}\mspace{14mu}{vectors}} = x}\end{pmatrix}} = {\Pr\mspace{14mu}\begin{pmatrix}{{HLA}\mspace{14mu}{vectors}\mspace{14mu}{differ}\mspace{14mu}{in}} \\{{exactly}\mspace{14mu} x\mspace{14mu}{positions}}\end{pmatrix}}} \\{= {\Pr\mspace{14mu}\begin{pmatrix}{x\mspace{14mu}{successes}\mspace{14mu}{in}\mspace{14mu} k\mspace{14mu}{Bernoulli}} \\{{trials},{where}} \\{{success} = \{ {{{HLA}_{m}\lbrack j\rbrack} \neq {{HLA}_{l}\lbrack j\rbrack}} \}} \\{{{and}\mspace{14mu} p} = {{\Pr\mspace{11mu}({success})} = \frac{1}{2}}}\end{pmatrix}}} \\{= {\begin{pmatrix}k \\x\end{pmatrix}\;( \frac{1}{2} )^{x}( \frac{1}{2} )^{k - x}}} \\{= {\begin{pmatrix}k \\x\end{pmatrix}2^{- k}}}\end{matrix}$Thus, for a fixed pair of alleles,

$\begin{matrix}\begin{matrix}{{\Pr( {x \geq 1} )} = {1 - {\Pr( {x = 0} )}}} & \\{= {1 - {\begin{pmatrix}k \\0\end{pmatrix}2^{- k}}}} & {( {{by}\mspace{14mu}(11)} )} \\{{= {1 - 2^{k}}},} & \end{matrix} & (12) \\{and} & \; \\\begin{matrix}{{\Pr( {\forall_{pairs}\mspace{14mu}{x \geq 1}} )} = {\prod\limits_{pairs}{\Pr( {x \geq 1} )}}} & {( {{by}\mspace{14mu}{{indep}.}} )} \\{= {\prod\limits_{pairs}( {1 - 2^{- k}} )}} & \\{= ( {1 - 2^{- k}} )^{\#\mspace{14mu}{pairs}}} & {( {{by}\mspace{14mu}(12)} )} \\{= ( {1 - 2^{- k}} )^{(\begin{matrix}N \\2\end{matrix})}} & \end{matrix} & (13)\end{matrix}$since there are N distinct allele vectors and pairs are unordered.

This probability ideally should be greater than (1−ε) for some fixedsmall 0<ε<<1, i.e.,

$\begin{matrix}{( {1 - 2^{- k}} )^{(\begin{matrix}N \\2\end{matrix})}\overset{want}{>}{1 - {\varepsilon\;.}}} & (14)\end{matrix}$First, the left-hand side term is bound:

$\begin{matrix}{( {1 - 2^{- k}} )^{(\begin{matrix}N \\2\end{matrix})} = \lbrack ( {1 - 2^{- k}} )^{2^{k}} \rbrack^{{(\begin{matrix}N \\2\end{matrix})}2^{- k}}} \\{> ( {\mathbb{e}}^{{- 1} - 2^{- k}} )^{{(\begin{matrix}N \\2\end{matrix})}2^{- k}}} \\{= {\mathbb{e}}^{{- {(\begin{matrix}N \\2\end{matrix})}}{({1 + 2^{- k}})}\; 2^{- k}}}\end{matrix}$where the inequality comes from the bound (appendix A)

$\begin{matrix}\begin{matrix}{( {1 - \frac{1}{n}} )^{n} > {\mathbb{e}}^{{- 1} - \frac{1}{n}}} & \; & {{for}\mspace{14mu}{large}\mspace{14mu}{n.}}\end{matrix} & (15)\end{matrix}$For the inequality in (14) to operate, the bound (15) should be in thecorrect direction. Suppose a>b is desired. If instead a>c isdemonstrated and the parameter is selected such that c>b, then it can beconcluded from a>c>b that a>b. Therefore, the above inequality chainwould work if found (15) holds.

Hereafter, the symbol

is used to indicate the steps in the inequality reduction that cansatisfy the previous statements whenever the parameter in question isselected to satisfy the current statement.

Thus, an inequality (14) can be reduced to the following:

$\begin{matrix}{{\mathbb{e}}^{{- {(\begin{matrix}N \\2\end{matrix})}}{({1 + 2^{- k}})}2^{- k}} > {1 - { \varepsilon\mspace{14mu}\Longleftrightarrow\mspace{14mu}{- \begin{pmatrix}N \\2\end{pmatrix}} ( {1 + 2^{- k}} )2^{- k}}} > {\ln( {1 - e} )}} & (16)\end{matrix}$Next, consider the right-hand term: ln (1−x)<−x for 0<x<1. Again, a>bcan be desired. If b<d is demonstrated and the parameter is selectedsuch that a>d, it can be concluded from a>d>b that a>b. This permits thereduction of the inequality (16) to the following:

$\begin{matrix}{{{- \begin{pmatrix}N \\2\end{pmatrix}}( {1 + 2^{- k}} )\; 2^{- k}} > {- \varepsilon}} & (17) \\{ \Longleftrightarrow\mspace{14mu}\varepsilon  > {\begin{pmatrix}N \\2\end{pmatrix}\;( {1 + 2^{- k}} )\; 2^{- k}}} & \; \\{{ \Longleftrightarrow\mspace{14mu}\frac{4^{k}}{2^{k} + 1}  > {( {1/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}}},} & (18) \\{since} & \; \\{{( {1 + 2^{- k}} )\; 2^{- k}} = {{( {2^{k} + 1} )\; 2^{{- 2}k}} = {( {2^{k} + 1} )\;{4^{- k}.}}}} & (19)\end{matrix}$Furthermore,

$\begin{matrix}\begin{matrix}{\frac{\beta^{2}}{\beta + 1} = \frac{\beta^{2} + \beta - \beta - 1 + 1}{\beta + 1}} & \; \\{= {{\beta - 1 + \frac{1}{\beta + 1}} > {\beta - 1}}} & {\forall{\beta > {- 1.}}} \\\; & \;\end{matrix} & (20)\end{matrix}$Hence, taking β=2^(k) yields

$\begin{matrix}{\frac{4^{k}}{2^{k} + 1} > {2^{k} - 1.}} & (21)\end{matrix}$Thus, (18) follows if k is chosen to satisfy

             (22)$\mspace{20mu}{{2^{k} - 1} > {( {1/\varepsilon} )\; \begin{pmatrix}N \\2\end{pmatrix}\mspace{20mu}\Longleftrightarrow\mspace{14mu}{2^{k} > {{( {1/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}} + 1}} }}$The remaining chain of inequalities, from “desired” to “obtained,” cannow be verified. For example, the inequality (17) can be extended

${{{- \begin{pmatrix}N \\2\end{pmatrix}}\;( {1 + 2^{- k}} )\; 2^{- k}} > {- \varepsilon}\; >  {\ln( {1 - \varepsilon} )}\Longrightarrow\mspace{14mu}{\mathbb{e}}^{{- {(\begin{matrix}N \\2\end{matrix})}}\;{({1 + 2^{- k}})}\; 2^{- k}}  > {1 - \varepsilon}},$which in turn can be extended

$( {1 - 2^{- k}} )^{(\begin{matrix}N \\2\end{matrix})} > {\mathbb{e}}^{{- {(\begin{matrix}N \\2\end{matrix})}}\;{({1 + 2^{- k}})}\; 2^{- k}} > {1 - {\varepsilon\begin{matrix}{{ \Longrightarrow\mspace{14mu}( {1 - 2^{- k}} )^{(\begin{matrix}N \\2\end{matrix})}  > {1 - \varepsilon}},} & {{as}\mspace{14mu}{{desired}.}}\end{matrix}}}$Therefore, k (given ε, N) is selected to satisfy (22):

$2^{k} > {{( {1/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}} + 1}$A simpler bound on k can be obtained by imposing a stronger condition

$\begin{matrix}{{2^{k}\overset{want}{>}{( {2/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}}},} & (23)\end{matrix}$which implies (22) since

${( {{1/} \in} )\begin{pmatrix}N \\2\end{pmatrix}} > 1.$The right-hand side of equation (23) simplifies to

${{( {2/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}} = {{( {2/\varepsilon} )\;\frac{N( {N - 1} )}{2}} = {( {1/\varepsilon} )\; N\;( {N - 1} )}}},$so thatk>lg N+lg (N−1)−lg ε  (24)is equivalent to equation (23). Furthermore, since 2 lg N>lg N+lg (N−1),selectingk>2 lg N−lg εcertainly gives a value of k that satisfies (23). Therefore, requiring

$\begin{matrix}{k > {{2\mspace{11mu}\lg\mspace{11mu} N} - {\lg\mspace{11mu}\varepsilon}}} & (25)\end{matrix}$imposes the strongest condition of those listed above. Hence, a value ofk that satisfies equation (25) also satisfies equation (22), andtherefore the original desired inequality (14).

b) Dissimilar PRV Entries. A violation of the similarity assumptions maybe modeled by an error term δ. Suppose a probe fails to contribute to aHammning distance with probability (1+δ)/2. As discussed previously,each position of the HLA code vector can be considered as a Bernoullitrial, where success is defined as the event that j^(th) entry of a codevector contributes to the Hamming distance, i.e.,{HLA_(m)[j]≠HLA_(l)[j]}, so thatq=Pr(failure)=(1+δ)/2p=Pr(success)=(1−δ)/2Therefore,

$\begin{matrix}\begin{matrix}{{\Pr\mspace{11mu}( {{{The}\mspace{14mu}{Hamm}\mspace{14mu}{dist}} = x} )} = {\begin{pmatrix}k \\x\end{pmatrix}\;( \frac{1 - \delta}{2} )^{x}( \frac{1 + \delta}{2} )^{k - x}}} \\{= {\begin{pmatrix}k \\x\end{pmatrix}\;( {1 - \delta} )^{x}( {1 + \delta} )^{k - x}2^{- k}}}\end{matrix} & (26)\end{matrix}$Continuing as before, the following condition can be obtained.

$\begin{matrix}{{{{\Pr\;( {x \geq 1} )} = {{1 - {\Pr\;( {x = 0} )}}\mspace{101mu} = {{1 - {\begin{pmatrix}k \\0\end{pmatrix}\;( {1 - \delta} )^{0}( {1 + \delta} )^{k - 0}2^{- k}}}\mspace{101mu} = {1 - {( {1 + \delta} )^{k}\; 2^{- k}}}}}},{and}}{{\Pr\;( {\forall_{pairs}{x \geq 1}} )} = {( {1 - {( {1 + \delta} )^{k}2^{- k}}} )^{(\begin{matrix}N \\2\end{matrix})}.}}} & (27)\end{matrix}$This probability should be bigger than (1−ε). In other words,

$\begin{matrix}{( {1 - {( {1 + \delta} )^{k}2^{- k}}} )^{(\begin{matrix}N \\2\end{matrix})}\;\overset{want}{>}\;{1 - \varepsilon}} & (28) \\{{{LHS}(28)}\;\overset{{by}\mspace{14mu}{(15)}}{>}\;{\mathbb{e}}^{{- {(\begin{matrix}N \\2\end{matrix})}}\;{({1 + {({{({1 + \delta})}/2})}^{k}})}\;{({{({1 + \delta})}/2})}^{k}}} & (29) \\ \Longleftarrow\mspace{59mu}{\overset{want}{>}\mspace{11mu}{1 - \varepsilon}}  & \; \\{{ \Longleftrightarrow\mspace{14mu}{- {\begin{pmatrix}N \\2\end{pmatrix}\;\lbrack {1 + ( \frac{1 + \delta}{2} )^{k}} \rbrack}} \;( \frac{1 + \delta}{2} )^{k}} > {\ln( {1 - \varepsilon} )}} & (30) \\{{ \Longleftarrow\mspace{14mu}{- {\begin{pmatrix}N \\2\end{pmatrix}\;\lbrack {1 + ( \frac{1 + \delta}{2} )^{k}} \rbrack}} \;( \frac{1 + \delta}{2} )^{k}}\mspace{11mu}\overset{want}{>}\mspace{11mu}{- \varepsilon}} & \; \\{ \Longleftrightarrow\mspace{14mu}\varepsilon  > {{\begin{pmatrix}N \\2\end{pmatrix}\;\lbrack {1 + ( \frac{1 + \delta}{2} )^{k}} \rbrack}\;( \frac{1 + \delta}{2} )^{k}}} & \; \\{{ \Longleftrightarrow\mspace{14mu}\frac{( \frac{2}{1 + \delta} )^{2k}}{( \frac{2}{1 + \delta} )^{k} + 1}  > {( {1/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}}},} & \;\end{matrix}$where the last transformation is obtained as in equation (19), replacing2 by 2/(1+δ). The same substitution in (21) (i.e., taking β=(2/(1+δ)^(k)in (20)) yields

$\begin{matrix}\begin{matrix}{\frac{( \frac{2}{1 + \delta} )^{2k}}{( \frac{2}{1 + \delta} )^{k} + 1} > {( \frac{2}{1 + \delta} )^{k} - 1}} & \; & {\forall{k \in {\mathbb{N}}}}\end{matrix} & (31)\end{matrix}$Thus, equation (30) follows if k is chosen to satisfy

            (32)$\mspace{20mu}{{( \frac{2}{1 + \delta} )^{k} - 1} > {( {1/\varepsilon} )\; \begin{pmatrix}N \\2\end{pmatrix}\mspace{20mu}\Longleftrightarrow\;{\mspace{11mu}{( \frac{2}{1 + \delta} )^{k} > {{( {1/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}} + 1}}} }}$Again, a simpler bound on k can be obtained by imposing a strongercondition

$\begin{matrix}{{( \frac{2}{1 + \delta} )^{k}\overset{want}{>}{( {2/\varepsilon} )\;\begin{pmatrix}N \\2\end{pmatrix}}} = {( {1/\varepsilon} ){N( {N - 1} )}}} & (33)\end{matrix}$so that

$\begin{matrix}{{\lg\;\lbrack ( \frac{2}{1 + \delta} )^{k} \rbrack} = {k( {1 - {\lg( {1 + \delta} )}} )}} & (34) \\{\mspace{124mu}{\overset{want}{>}\;{{\lg\mspace{11mu} N} + {\lg( {N - 1} )} - {\lg\mspace{11mu}\varepsilon}}}} & \; \\{ \Longleftarrow\mspace{11mu}{k( {1 - {\lg( {1 + \delta} )}} )}  > {{2\mspace{11mu}\lg\mspace{11mu} N} - {\lg\mspace{11mu}{\varepsilon.}}}} & \; \\{k > {\frac{1}{( {1 - {\lg( {1 + \delta} )}} )}\lbrack {{2\mspace{11mu}\lg\mspace{11mu} N} - {\lg\mspace{11mu}\varepsilon}} \rbrack}} & (35)\end{matrix}$

c) Non-unit Minimum Hamming Distance. Further, the preferable size k canbe estimated for (almost) any desired minimum Hamming distance betweenallele code vectors. As demonstrated above, the Hamming distance betweena pair of HLA vectors is a binomial random variable x˜S (n, p) where #trials ≡n=k, Pr (success)≡p=(1−δ)/2, and Pr (failure)≡q=(1+δ)/2:

${\Pr(x)} = {\begin{pmatrix}k \\x\end{pmatrix}( {1 - \delta} )^{x}( {1 + \delta} )^{k - x}2^{- k}}$Its mean is np=k(1−δ)/2 and variance is npq=k(1−δ)/4. The followingestimate can be obtained (using Chernoff bounds):Pr(x≦k(1−δ)/4)≦e ^(−k(1−δ)/16)  (36)Chernoff inequality states (see Appendix B for the proof):

$\begin{matrix}{{\Pr\;( {{S( {n,p} )} \leq {( {1 - \lambda} )\;{np}}} )} \leq e^{{- \frac{\lambda^{2}}{2}}{np}}} & (37)\end{matrix}$n=k, p=(1−δ)/2, and let λ=½. Then by equation (37),

$\begin{matrix}{{\Pr\;( {x \leq {{k( {1 - \delta} )}/4}} )} \leq {\mathbb{e}}^{{- \frac{{({1 - 2})}^{2}}{2}}k\;\frac{1 - \delta}{2}}} \\{= {\mathbb{e}}^{{- \frac{1}{8}}k\;\frac{1 - \delta}{2}}} \\{= {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}}\end{matrix}$Thus, it can be estimated for which kPr(∀_(pairs) x≧k(1−δ)/4)≧1−ε  (38)From equation (36), it is possible to obtainPr(x≧k(1−δ)/4)≧1−e ^(−k(1−δ)/16)and hence,

$\begin{matrix}\begin{matrix}{{\Pr\;( {\forall_{pairs}{x \geq {{k( {1 - \delta} )}/4}}} )} = {\Pr\;( {x \geq {{k( {1 - \delta} )}/4}} )^{(\begin{matrix}N \\2\end{matrix})}}} \\{\mspace{259mu}{\geq ( {1 - {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} )^{(\begin{matrix}N \\2\end{matrix})}}} \\{\mspace{250mu}{\overset{want}{>}\;{1 - \varepsilon}}} \\{ \Longleftarrow\;( {{expression}\mspace{14mu}(39)} )  >} \\{\mspace{110mu}{{\exp\{ {{- \begin{pmatrix}N \\2\end{pmatrix}}( {1 + {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} )\;{\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} \}}\overset{want}{>}\;{1 - \varepsilon}}} \\{{ \Longleftrightarrow\;{- \begin{pmatrix}N \\2\end{pmatrix}} ( {1 + {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} ){\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} > {\ln( {1 - \varepsilon} )}} \\{{ \Longleftarrow\;{- \begin{pmatrix}N \\2\end{pmatrix}} ( {1 + {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} )\;{\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} > {- \varepsilon}} \\( {{see}\mspace{14mu}(17)} ) \\{ \Longleftrightarrow\;\varepsilon  > {\begin{pmatrix}N \\2\end{pmatrix}( {1 + {\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}} ){\mathbb{e}}^{{- {k{({1 - \delta})}}}/16}}} \\{ \Longleftrightarrow\;\frac{{\mathbb{e}}^{{k{({1 - \delta})}}/8}}{{\mathbb{e}}^{{k{({1 - \delta})}}/16}}  > {( {1/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}}} \\{{ \Longleftarrow\;{\mathbb{e}}^{{k{({1 - \delta})}}/16}  - 1} > \begin{matrix}{( {1/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}} & \; & ( {{by}\mspace{14mu}(20)} )\end{matrix}} \\{ \Longleftrightarrow\;{\mathbb{e}}^{{k{({1 - \delta})}}/16}  > {{( {1/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}} + 1}} \\{{ \Longleftarrow\;{\mathbb{e}}^{{k{({1 - \delta})}}/16}  > {( {2/\varepsilon} )\begin{pmatrix}N \\2\end{pmatrix}}} = {( {1/\varepsilon} ){N( {N - 1} )}}} \\( {{see}\mspace{14mu}(23)} ) \\{{ \Longleftrightarrow\;{k( {1 - \delta} )} /16} > {{\ln\mspace{11mu} N} + {\ln( {N - 1} )} - {\ln\mspace{11mu}\varepsilon}}} \\{{ \Longleftarrow\;{k( {1 - \delta} )} /16} > {{2\mspace{11mu}\ln\mspace{11mu} N} - {\ln\mspace{11mu}\varepsilon}}} \\{ \Longleftrightarrow\; k  > {\frac{16}{1 - \delta}\lbrack {{2\mspace{11mu}\ln\mspace{11mu} N} - {\ln\mspace{11mu}\varepsilon}} \rbrack}}\end{matrix} & (39)\end{matrix}$Therefore if,

             (40)$\mspace{20mu}{k > {\frac{16}{1 - \delta}\lbrack {{2\mspace{11mu}\ln\mspace{11mu} N} - {\ln\mspace{11mu}\varepsilon}} \rbrack}}$  then   Pr  (∀_(pairs)x ≥ k(1 − δ)/4) ≥ 1 − εFor this exemplary probe set (with k=k(ε, N, δ)), an arbitrarily highprobability can be obtained, by the selection of ε in (38), that all BLAcoding vectors have pairwise Hamming distance of at least k(1−δ)/4.Thus, this exemplary probe set is able to correct k(1−δ)/8 errors bychoosing the coding vector closest to that obtained.

The error term δ should then be estimated. This can be accomplished on agiven set S of probes by sampling pairs {l, m} of indices on probes fromS and examining the resulting 2-vectors on {0, 1}. The probability of afailure to contribute to the Hamming distance (given by (1+δ)/2) can beestimated by the frequency f₌ of observing equal entries in the2-vector, since each probe with equal entries in the 2-vector fails tocontribute to the Hamming distance between alleles l and m. Therefore,{circumflex over (δ)}=2f ₌−1  (41)Let f_(≠) denote the frequency of observing unequal entries in the samesetting. Thus, f₌+f_(≠)=1, and it is possible to obtain1−δ=1−(2f₌−1)=2(1−f₌)=2f_(≠).  (42)The bound in equation (40) becomes

$\begin{matrix}\begin{matrix}{k > {\frac{16}{2f_{\neq}}\lbrack {{2\mspace{11mu}\ln\mspace{11mu} N} - {\ln\mspace{11mu}\varepsilon}} \rbrack}} \\{= {\frac{8}{f_{\neq}}\lbrack {{2\mspace{11mu}\ln\mspace{11mu} N} - {\ln\mspace{11mu}\varepsilon}} \rbrack}}\end{matrix} & (43)\end{matrix}$

Thus, to generate distinct coding vectors for all alleles (e.g., toguarantee a Hamming distance d_(H)(c_(i), c_(j))≧1 w.p. >1−ε), it ispossible to select M>k, where k satisfies (35) with δ estimated as in(41). It is also possible to choose M to allow for error correction ofup to D/2 errors (guaranteeing w.p.>(1−ε) a minimum Hamming distanced_(H)(c_(i), c_(j))≧D): setD=k(1−δ)/4=kf _(≠)/2in equation (38), so that k=2D/f_(≠) should satisfy equation (40), and,again, select M>k.D. Pre-Processing

1. Initial Probe Selection

In section B.2, it was described that starting with all possible probesresults in a graph that has many vertices. The discussion below providescertain pre-processing steps that allow the elimination of a largeportion of this probe set.

a. Probes that do not hit the BLA region on any allele. Many of thepossible length-L probes generally do not provide sequence-specificinformation about the target. As such, they may be safely left out ofour probe selection process. This would allow for a reduction of thestarting (perfectly matched) probe set to those probes that arecomplementary to a subsequence of at least one of the alleles. A way toobtain such set can be as follows. Assume that the allele sequences areprovided in the 5′ to 3′ orientation.

A window of length L can be considered along allele T₁. Denote thelength of the allelic sequence by len(T₁), and index elements of thesequence starting with 1, so that the entire allele sequence can bedenoted byT₁[1] . . . T₁[len(T₁)].A probe complementary to the allele subsequence seen through the window[1 . . . L] can be constructed and placed in the set. The window thenmay be shifted by one nucleotide in the direction of the 3′-end. Thisprocess can be repeated until the last window [k . . . (L+k−1)] reachesthe end of the target sequence:L+k−1=len(T ₁)k=len(T ₁)−L+1,thus, generating a set of (len(T₁)−L+1) probes, each perfectlycomplementary to T₁.

The procedure described above can generate all probes of length L thatare, e.g., perfectly complementary to a length-L subsequence of thetarget (ie., allele) sequence. Depending on the form in which the allelesequences are given, it may also be desirable to include probescorresponding to windows that are partially shifted off the allele,i.e., windows showing a portion of the given allele sequence togetherwith the corresponding 5′-tail of the sequence, if the window is shiftedto the left, or the 3′-tail, if the window is shifted to the right.There are 2(L−1) such probes, corresponding to indices[(len(T₁)−L+2) . . . (len(T₁)+1)] , . . . ,[len(T₁) . . . (len(T₁)+L−1)]for the right-shifted windows and “indices”[0 . . . (L−1)], . . . , [(2−L) . . . 1]for the left-shifted windows.

This process can be repeated for the other alleles T₂, . . . ,T_(N). Toavoid placing duplicate probes in the set, a generated probe can beadded to the set if its sequence is not already present. Alternately,the duplicates can be weeded out subsequently. It should be noted thatthis may have the added advantage of eliminating probes hitting sequencerepeats.

The above-described exemplary process has the effect of eliminatingprobes that hit genomic sequences outside the target region, includingprobes that hit introns (if allelic sequences are provided in genomicDNA form) from the original collection of all possible probes of lengthL. The resulting set may contain only those probes complementary tosubsequences of the HLA region, or sub-words of the pool of all allelesequences.

b. Non-informative probes. In the set created as described in theprevious section, some (and perhaps many) of the probes would not beable to give any information useful for distinguishing among thealleles. These are the probes that may be drawn from the windows thatare shared among the alleles—they hybridize to a common subsequence ofthe alleles. Any such probe may be useless for discriminating alleles:to such a probe, all alleles will look alike. Therefore, these probescan be safely eliminated from the potential probe set.

-   -   To locate all such probes, it is necessary to find the common        subsequences of length≧L of all the alleles, identify the probes        complementary to these subsequences, and remove these probes        from the set. This can be done as the next step in the        “refinement” of the starting probe set, and/or included as a        condition in the process for probe addition specified in the        previous section.

c. Potential for cross-hybridization. The probes that are likely to hitmultiple sites on the target sequence(s), such as those hitting arepeated region, should be eliminated, as is usually done in microarraydesign, as their use is likely to produce a high level of noise. Eachprobe is usually expected to have a unique site on the target. See e.g.,Lockhart et al., Nature Biotechnology 1996, 14:1675; Kaderali andSchliep, An algorithm to select target specific probes for DNA chips atURL http://citeseer.nj.nec.com/kaderali01algorithm.html; Li and Stormo,Bioinformatics, 2001, 17:1067.

2. Graph Generation

a. Generating Probe Response Vectors. Once a set of initial probes isselected, as described above, a probe response vector (7) must begenerated for each of these probes. To do that, given probe j,string-matching is performed on each of the N alleles for theWatson-Crick complement of probe j, and the results are used to set

${v_{j}\lbrack i\rbrack} = \{ \begin{matrix}{1,} & {{if}\mspace{14mu}{there}\mspace{14mu}{is}\mspace{14mu} a\mspace{14mu}{match}\mspace{14mu}{with}\mspace{14mu}{allele}\mspace{14mu} T_{i}} \\{0,} & {otherwise}\end{matrix} $

b. Choice of Edge Threshold. The edge threshold parameter ρ was used insection B.3 to transform the initial complete edge-weighted graph. Itsvalue determines how many edges remain in the graph, as well as how“independent” each independent set on the graph really is.

-   -   If ρ is too small, there will be very few edges in the graph.        Most of the random sets selected by the boosting algorithm will        prove to be independent. However, upon examination in        post-processing, described in section B.3, it may be found that        many of these sets do not possess enough discrimination power to        discern all N known alleles.    -   If, on the other hand, ρ is too large (e.g., ρ>0.5), the graph        will be very dense (i.e., have a lot of edges). The algorithm        will then have a much harder time finding an independent set of        large enough size. The output sets will likely contain much        fewer than M vertices, and there may not be enough probes in the        candidate sets to discern all N alleles.    -   A reasonable value of ρ is obtained by trial and error on a        given set of potential probes.        E. Post-Processing

The procedure (as described in section C.1) can return a list of 20 bestindependent sets, sorted by the total information weight of theconstituent vertices. Each set may be composed of e.g., at most Mvertices (probes). While the independence and maximum weight conditionscan be selected to steer each selected set towards maximumdiscrimination power, this desired outcome may not be guaranteed. Thus,each of these best independent sets should be checked for redundancy ofthe allele coding vectors. Given a set S of probe response vectors, theN allele coding vectors generated by S (the rows in (6)) may beextracted and their pairwise Hamming distances d_(H)(c_(i),c_(j)),1≦i≦j≦|S| may be computed. If min_({ij}) d_(H)(c_(i), c_(j))=0, the setlacks discrimination power: at least two of the codes are the same, sothe set will not be able to discern all known alleles. Such a set shouldnot be used “as is”, and should either be discarded or supplemented byadditional probes. This may indicate that the set is not trulyindependent, so the choice of edge threshold ρ, discussed in SectionD.2.b, was inappropriate.

It is possible to make the testing more stringent, in order to allow forup to D/2 errors in the data, as discussed in Section C.6. Thoseindependent sets of the list of best sets that pass the redundancytesting (by satisfying min_({ij}) d_(H)(c_(i), c_(j))≧D), in factsatisfy a definition stronger than that formulated for the bestindependent set. D-best independent set denotes a best independent setwith the additional condition min_({i,j}) d_(H)(c_(i), c_(j))≧D.

Those sets that pass the redundancy test can be reordered byaveHamDist=ave_({i,j}) d_(H)(c_(i),c_(j)): once the minimum allele codeseparation is guaranteed, the usefulness of a probe set to the HLAtyping problem can be judged by the metric aveHamDist.

F. Interpreting Results

Given the best independent set generated by the above-describedprocedure a determination regarding how it can be converted into amicroarray for genotyping or haplotyping experiments must be made.

In order to generate the microarray corresponding to an independent setof vectors yielded by the procedure (followed by post-processing stepsfrom Section E), the DNA sequence for each probe, which was used togenerate the probe response vector used in the analysis, should berecalled. The spatial arrangement of these probes on the chip surfacecan be decided as discussed above in Section A.4.b.

G. Additional Applications

Many extensions of the approaches presented herein are possible withinthe scope of the present invention. Two exemplary approaches arediscussed below.

1. Extending Weight Functions in the Graph Model

The graph model discussed herein relies on the characteristics of theprobe response vectors to define the weights of vertices and edges.While this model can generate certain interesting results, it can beextended to a more meaningful model by incorporating the physicalproperties of the probe sequences and their interactions, some of whichare described in Cherepinsky, Ph.D. Thesis, New York University 2003,Chapter 3.

In particular, the annotation of all potential probes with physicalproperties, such as melting temperature, free energy, entropy, andenthalpy of hybridization, for perfect matches and for closest matchesin other alleles can be used to define cost functions that determine theweights. While the vertex weight may provide a measure of theperformance of the corresponding probe in discriminating among knownalleles, the pairwise probe interaction and the resulting competitioneffects, as described in Cherepinsky, Ph.D. Thesis, New York University2003, Chapter 3, can be reflected in the edge weights.

2. Pooling Real Data from Previously Tested Chips

Another extension of the procedure discussed herein involves the use ofdata from microarray chips used for HLA typing by different companies.Many biotechnology companies are working on the HLA typing problem inthe hope of designing probe sets that give the answer more quickly andwith greater accuracy. The sequences of the probes may generally beconsidered to be proprietary information and thus not shared. As aresult, the collection of experimental data from testing the variousprobes in different combinations and arrangements on the microarraychips generated by different companies may almost never be examined as awhole.

It is possible to employ the probe interaction model presented herein tomake use of the aggregate experimental data. Suppose the followinginformation can be obtained: a set of microarray chips along with someidentifiers, if not the actual sequences, of the probes comprising eachchip, and values measuring the performance of each chip in allpreviously conducted HLA typing experiments. That is, for each chip,there is a list of unique probe identifiers and some measure of how wellthis chip performed in HLA typing. It may not be necessary to know thesequence of each probe, so long as the uniqueness of the identifiers canbe verified by the company providing the data. It is possible to combinethe information from a large number of such previously tested chips togenerate a plan for a new microarray chip (i.e., a collection of probeidentifiers and their spatial arrangement) with a performance valuehigher than that of “input” chips by the following process. The probecontent and arrangement for each chip, together with its performancevalue, can be used to build the graph model. Vertex weights can beinferred from chip membership information. Edge weights can be estimatedfrom conditional probabilities using pairwise membershipinformation—that is, by considering two chips at a time, quantities suchas the conditional probability that probe P_(i) was used on chip C_(j),given that it was used on chip C_(k), can be estimated. Once the graphis constructed, the boosting-algorithm can be used to generate the bestset of probes, as discussed in Section C.1.

All publications cited above are incorporated herein by reference intheir entireties.

APPENDIX

Appendix A: Exponential Limit Inequality: Proof

Claim: For large n,

$\begin{matrix}{( {1 - \frac{1}{n}} )^{n} > {{\mathbb{e}}^{{- 1} - \frac{1}{n}}.}} & ({A1})\end{matrix}$

Proof:

Inequality (A1) is equivalent to

$\begin{matrix}{{\ln\lbrack ( {1 - \frac{1}{n}} )^{n} \rbrack}\overset{want}{>}{{- 1} - {\frac{1}{n}.}}} & ({A2})\end{matrix}$

Since the series expansion of the logarithm is given by

$\begin{matrix}{{{\ln( {1 - x} )} = {{- {\sum\limits_{j = 1}^{\infty}\frac{x^{j}}{j}}} = {{{- x} - \frac{x^{2}}{2} - \frac{x^{3}}{3} - {\ldots\mspace{14mu}{for}\mspace{14mu}{x}}} < 1}}},} & ({A3})\end{matrix}$we can expand the left-hand side of (A2) as follows.

$\begin{matrix}\begin{matrix}{{\ln\lbrack ( {1\; - \;\frac{1}{n}} )^{n} \rbrack} = {n\;{\ln( {1\; - \;\frac{1}{n}} )}}} \\{= {n\;\{ {- {\sum\limits_{j\; = \; 1}^{\infty}\frac{( \frac{1}{n} )^{j}}{j}}}\mspace{11mu} \}\mspace{14mu}( {{by}\mspace{11mu}( {A\; 3} )} )}} \\{= {{- n}\;\{ {\sum\limits_{j\; = \; 1}^{\infty}\;\frac{1}{{jn}^{j}}}\mspace{11mu} \}}} \\{= {- {\sum\limits_{j\; = \; 1}^{\infty}\;\frac{1}{{jn}^{j\; - \; 1}}}}} \\{= {{- 1}\; - \;\frac{1}{2\; n}\; - \;\frac{1}{3\; n^{2}}\; - \;\frac{1}{4\; n^{3}}\; - \;\ldots}}\end{matrix} & ( {A\; 4} )\end{matrix}$

Thus, inequality (A2) reduces to

$\begin{matrix}{{{{- 1} - \frac{1}{2n} - \frac{1}{3n^{2}} - \frac{1}{4n^{3}} - \ldots}\mspace{11mu}\overset{want}{>}{{- 1} - \frac{1}{n}}}{{or},\mspace{14mu}{equivalently},}} & ( {A\; 5} ) \\{{{ \Longleftrightarrow\;\frac{1}{\;{3\; n^{\; 2}}}  + \frac{1}{4n^{3}} + \frac{1}{5n^{4}} + \ldots}\overset{want}{<}{{- \frac{1}{2n}} + \frac{1}{n}}} = \frac{1}{2n}} & ({A6})\end{matrix}$Now,

$\begin{matrix}{{\frac{1}{3n^{2}} + \frac{1}{4n^{3}} + \frac{1}{5n^{4}} + \ldots} < {\frac{1}{3n^{2}} + \frac{1}{3n^{3}} + \frac{1}{3n^{4}} + \ldots}} & \\{= {\frac{1}{3n^{2}}( {1 + \frac{1}{n} + \frac{1}{n^{2}} + \ldots}\mspace{11mu} )}} & \\{= {\frac{1}{3n^{2}} \cdot \frac{1}{1 - \frac{1}{n}}}} & {( {{geometric}\mspace{14mu}{sum}} )} \\{{\frac{1}{3n} \cdot \frac{1}{n - 1}}\overset{want}{<}\frac{1}{2n}} & {( {{by}\mspace{11mu}( {A\; 6} )} )}\end{matrix}$Simplifying yields

${{ \Longleftrightarrow\;\frac{1}{3( {n - 1} )} \overset{want}{<} \frac{1}{2}\Longleftrightarrow\; 2  < {3( {n - 1} )}} = {{{3n} -  3\Longleftrightarrow\; 5 } < {3n}}},$which holds for every n≧2. Retracing the chain of inequalities, weobtain

$\begin{matrix}{( {1 - \frac{1}{n}} )^{n} > {{\mathbb{e}}^{{- 1} - \frac{1}{n}}\mspace{11mu}{\forall{n \geq 2}}}} & ({A7})\end{matrix}$as desired.Appendix B: Chernoff's Inequality: Proof

Claim:

$\begin{matrix}{{{\Pr( {{S( {n,p} )} \leq {( {1 - \varepsilon} )n\; p}} )} \leq e^{{- s}\;\frac{2}{2}n\; p}},{\varepsilon \in {( {0,1} ).}}} & ({B1})\end{matrix}$

Proof:

S is a Binomial random variable:S(n,p)=X ₁ + . . . +X _(n),  (B2)where X_(i) are i.i.d.r.v.'s with

$\begin{matrix}{X_{i} = \{ {\begin{matrix}1 & {w.p.\; p} \\0 & {w.p.\mspace{11mu}( {1 - p} )}\end{matrix},\mspace{14mu}{i = 1},\ldots\mspace{11mu},n} } & ( {B\; 3} )\end{matrix}$Therefore,

$\begin{matrix}{{E(S)} = {{\sum\limits_{i = 1}^{n}{E( X_{i} )}} = {{\sum\limits_{i = 1}^{n}p} = {{np}.}}}} & ({B4})\end{matrix}$Since

$\begin{matrix}{{S \leq {{( {1 - \varepsilon} ) {np}\Longleftrightarrow\; S } - {np}} \leq {{- \varepsilon}\; {np}\Longleftrightarrow\begin{matrix}{{\lambda( {S - {np}} )} \leq {{- \lambda}\;\varepsilon\;{np}}} & \; & {\forall{\lambda > 0}}\end{matrix}\Longleftrightarrow\begin{matrix}{{{- \lambda}( {S - {np}} )} \geq {\lambda\;\varepsilon\;{np}}} & \; & {\forall{\lambda > 0}}\end{matrix}\Longleftrightarrow\; e^{- {\lambda{({S - {np}})}}} } \geq {e^{{\lambda\varepsilon}\;{np}}\mspace{79mu}{\forall{\lambda > 0}}}},} & \; & ({B5})\end{matrix}$it follows that

$\begin{matrix}{{\Pr\;( {S \leq {( {1 - \varepsilon} ){np}}} )} = {\Pr( {e^{- {\lambda{({S - {np}})}}} \geq e^{\lambda\;\varepsilon\;{np}}} )}} & ({B6}) \\{\mspace{175mu}\begin{matrix}{\;{\leq \frac{E\lbrack e^{- {\lambda{({S - {np}})}}} \rbrack}{e^{\lambda\;\varepsilon\;{np}}}}} &  {( {{by}\mspace{14mu}{Markov}}’ s\mspace{14mu}{inequality}} )\end{matrix}} & ({B7})\end{matrix}$For a proof of Markov's inequality, see, e.g., [24].

From (B2) and (B3), we know that

$\begin{matrix}{{S - {np}} = {{{\sum\limits_{i = 1}^{n}X_{i}} - {np}} = {\sum\limits_{i = 1}^{n}{( {X_{i} - p} ).}}}} & ({B8})\end{matrix}$Therefore,

$\begin{matrix}{{E\lbrack e^{- {\lambda{({S - {np}})}}} \rbrack} = {E\lbrack e^{{- \lambda}\;{\sum\limits_{i}{({X_{i} - p})}}} \rbrack}} & ({B9}) \\{\mspace{121mu}{= {E\lbrack {\prod\limits_{i = 1}^{n}e^{- {\lambda{({X_{i} - p})}}}} \rbrack}}} & \; \\\begin{matrix}{\mspace{115mu}{= {\prod\limits_{i = 1}^{n}{E\lbrack e^{- {\lambda{({X_{i} - p})}}} \rbrack}}}} & {( {{by}\mspace{14mu}{independence}} )}\end{matrix} & \; \\{\mspace{110mu}\begin{matrix}{= \{ {E\lbrack e^{- {\lambda{({X_{i} - p})}}} \rbrack} \}^{n}} & {( {{by}\mspace{14mu}({B3})} )}\end{matrix}} & \; \\{{E\lbrack e^{- {\lambda{({X_{1} - p})}}} \rbrack} = {{p\; e^{- {\lambda{({1 - p})}}}} + {( {1 - p} )\; e^{- {\lambda{({- p})}}}}}} & ({B10}) \\{\mspace{124mu}{= {e^{\lambda\; p}\;( {{p\; e^{- \lambda}} + ( {1 - p} )} )}}} & \; \\{\mspace{124mu}{= {e^{\lambda\; p}( {1 + \underset{\underset{u}{︸}}{p( {e^{- \lambda} - 1} )}} )}}} & \; \\{\mspace{121mu}\begin{matrix}{\leq {e^{\lambda\; p}( e^{p{({e^{- \lambda} - 1})}} )}} & ( {{{{since}\mspace{14mu} 1} + u} \leq {e^{u}\mspace{14mu}{\forall u}}} )\end{matrix}} & \; \\{\mspace{124mu}{= e^{p{({e^{- \lambda} - 1 + \lambda})}}}} & ({B11}) \\{\mspace{124mu}{{\leq e^{p\;\frac{\lambda^{2}}{2}}},}} & \;\end{matrix}$where the last inequality follows from

$\begin{matrix}{{ \begin{matrix}{e^{- \lambda} \leq {1 - \lambda + \frac{\lambda^{2}}{2}}} & {\forall{\lambda > 0}}\end{matrix}\Longrightarrow\; e^{- \lambda}  - 1 + \lambda} \leq \frac{\lambda^{2}}{2}} & ({B12})\end{matrix}$Therefore, by (B9) and (B11),

$\begin{matrix}{{{E\lbrack e^{- {\lambda{({S - {np}})}}} \rbrack} \leq ( e^{p\;\frac{\lambda^{2}}{2}} )^{n}} = e^{\frac{\lambda^{2}}{2}{np}}} & ({B13}) \\{and} & \; \\{{\Pr( {S \leq {( {1 - \varepsilon} ){np}}} )} \leq {e^{{- \lambda}\;\varepsilon\;{np}}\;{E\lbrack e^{- {\lambda{({S - {np}})}}} \rbrack}}} & ({B14}) \\{\mspace{185mu}{\leq {e^{{- \lambda}\;\varepsilon\;{np}}e^{\frac{\lambda^{2}}{2}{np}}}}} & \; \\{\begin{matrix}{= e^{{np}({\frac{\lambda^{2}}{2} - {\lambda\;\varepsilon}})}} & {\forall{\lambda > 0}}\end{matrix}} & \;\end{matrix}$so that

${f( \lambda^{*} )} = {e^{{np}({\frac{\varepsilon^{2}}{2} - \;\varepsilon^{2}})} = e^{{- \frac{\varepsilon^{2}}{2}}{np}}}$and, from (B14),

$\begin{matrix}\begin{matrix}{{\Pr\;( {S \leq {( {1 - \varepsilon} ){np}}} )} \leq e^{{- \frac{\varepsilon^{2}}{2}}{np}}} & {\forall{\varepsilon \in ( {0,1} )}}\end{matrix} & ({B16})\end{matrix}$as desired.

It remains to check that the optimizing λ=λ* is a minimum of f(λ), thatis, f″(λ*)>0. By (B15),

$\begin{matrix}{{f^{''}( \lambda^{*} )} = {f^{\prime}(  {{(\lambda) \cdot {{np}( {\lambda - \varepsilon} )}} + {{f(\lambda)} \cdot {np}}} |_{\lambda = \lambda^{*}} }} \\{= {{\underset{\underset{0}{︸}}{f^{\prime}( \lambda^{*} )} \cdot {{np}(0)}^{\prime}} + {{f( \lambda^{*} )} \cdot {np}}}} \\{= {{{np}\mspace{11mu} e^{{- \frac{\varepsilon^{2}}{2}}{np}}} > 0.}} \\{{\therefore\lambda^{*}} = {\varepsilon\mspace{20mu}{is}\mspace{14mu} a\mspace{14mu}{{minimum}.}}}\end{matrix}$

1. A method for at least one of genotyping or haplotyping a sequence ofpolymorphic genetic loci in a deoxyribonucleic acid (DNA) sample oridentifying a strain variant from the DNA sample, comprising: i)providing one or more microarrays that include a set of oligonucleotideprobes that are capable of detecting the at least one of the genotypes,the haplotypes or the strain variant; ii) hybridizing the DNA sample tothe one or more microarrays to create a hybridization pattern; iii)determining at least one of a genotype, a haplotype or a strain variantbased on the hybridization pattern; and iv) optimizing at least one ofthe set or an arrangement of the oligonucleotide probes as a function ofat least one of a match criteria or a mismatch criteria between a trueallele contained in the DNA sample and an allele determined by thehybridizing step.
 2. The method of claim 1, wherein the one or moremicroarrays include a set of oligonucleotide probes that are capable ofdetecting at least one of all known genotypes or all known haplotypes atthe polymorphic genetic loci or the strain identification.
 3. The methodof claim 1, wherein the one or more microarrays are configured toinclude at least one of an optimal set or an optimal arrangement ofoligonucleotide probes.
 4. The method of claim 1, wherein the mismatchcriteria is the following: $\begin{matrix}{\min{\sum\limits_{{type}\mspace{11mu} j}\;{w_{j}{E\lbrack \prod\limits_{T_{j} \neq {\hat{T}}_{j}}\; \rbrack}}}} \\ \Leftrightarrow{\min{\sum\limits_{{type}\mspace{14mu} j}\;{w_{j}{\Pr( {T_{j} \neq {\hat{T}}_{j}} )}}}} \end{matrix}.$ wherein T_(j) is a true allele contained in the DNAsample, {circumflex over (T)}_(j) is the allele determined by thehybridization step, ${\prod\limits_{x}\;{= \begin{Bmatrix}{1,} & {{if}\mspace{14mu} X\mspace{14mu}{is}\mspace{14mu}{true}} \\{0,} & {otherwise}\end{Bmatrix}}},$  and w_(j) is a weight assigned to at least one of thegenotype or the haplotype j.
 5. The method of claim 4, wherein theweights are provided as follows: w_(j)=1∀_(j), wherein ∀_(j) is a set ofat least one of all known genotypes or all known haplotypes at one ormore predetermined polymorphic genetic loci.
 6. The method of claim 4,wherein the weights are provided as follows: w_(j) is different for eachgenotype or haplotype.
 7. The method of claim 1, further comprising: atleast one of displaying or storing data associated with the at least oneof the genotype, the haplotype or the strain variant in a storagearrangement in at least one of a user-accessible format or auser-readable format.
 8. A method for at least one of genotyping orhaplotyping a sequence of polymorphic genetic loci in a deoxyribonucleicacid (DNA) sample or identifying a strain variant from the DNA sample,comprising: i) providing one or more microarrays that include a set ofoligonucleotide probes that are capable of detecting the at least one ofthe genotypes, the haplotypes or the strain variant; ii) hybridizing theDNA sample to the one or more microarrays to create a hybridizationpattern; and iii) determining at least one of a genotype, a haplotype ora strain variant based on the hybridization pattern, wherein step (iii)produces a vector of n measurements, and wherein n is a number of probescontained on the one or more microarrays.
 9. The method of claim 8,wherein the n potential probes provided to identify N known genotypes orhaplotypes are each associated with a response vector {right arrow over(υ)}_(j) ε{0,1}^(N), j=1, . . . , n.
 10. The method of claim 9, furthercomprising generating a graph G on vertices corresponding to proberesponse vectors.
 11. The method of claim 10, wherein the graph G is acomplete edge-weighted and vertex-weighted undirected graph G=(V, E)provided on n vertices, wherein n is the number of potential probes. 12.The method of claim 11, wherein the weights w of each vertex v and eachedge e are constrained by: 0≦w(v), w(e)≦1.
 13. The method of claim 12,wherein the weight w of a vertex v is set to: w(v)=min{fraction of 0's,fraction of 1's}.
 14. The method of claim 12, wherein the weight w of anedge e={u,v} is set to: w(e)=Hamming distance/vector length, whereinHamming distance is measured between the probe response vectorscorresponding to vertices u and v, and vector length is the length ofthe probe response vectors, namely, N.
 15. The method of claim 11,further comprising modifying the graph G by thresholding the edges suchthat the modified graph G_(mod) is defined as G_(mod)=(V, E_(mod)),wherein E_(mod)={e ε E: w(e)≦ρ}, and ρ is a selected threshold value.16. The method of claim 15, wherein, for the modified graph G_(mod) andthe probe set size M, the following is performed: i) initializing acurrent-best list of independent sets with associated informationweights, ii) initializing vertex boosting weights to vertex weightsw(v), iii) defining a probability distribution on the vertex subsetbased on vertex boosting weights, iv) choosing a random subset ofvertices of a specified size M based on the probability distribution, v)eliminating one of the end-point vertices in each of the edges remainingin the induced subgraph on the random subset, vi) modifying the vertexboosting weights by increasing the weights of the vertices that areretained in the subset and decreasing the weights of the vertices thatwere selected in step (iv) but eliminated in step (v), and vii)repeating steps (iii) through (vi) for at least one of a predeterminednumber of iterations or until no improvement to the list of topindependent sets is achieved.
 17. The method of claim 16, wherein, forthe modified graph G_(mod) and the probe set size M, steps (ii) through(vii) are repeated for a predetermined number of iterations, eachiteration starting with reinitializing vertex boosting weights to vertexweights w(v) in step (ii).
 18. The method of claim 17, wherein, for agiven fixed small 0<ε<<1, the probe set size M satisfies an inequalityPr(∀code pairs, Hamming distance≧1)>1−ε.
 19. The method of claim 17,wherein, for a given fixed small 0<ε<<1 and a fixed α>1, the probe setsize M satisfies an inequality Pr(∀code pairs, Hamming distance≧α)>1−ε.20. The method of claim 16, wherein the threshold ρ has a value toenable the graph G to have a sparsity bounded by A≦sparsity≦B, whereinthe sparsity is definable by the average degree of a vertex in the graphG.
 21. The method of claim 20, wherein the lower bound A is a relativelysmall constant, and the upper bound B is a function of the number ofvertices n.
 22. The method of claim 8, further comprising: at least oneof displaying or storing data associated with at least one of thegenotype, the haplotype, the strain variant or the vector in a storagearrangement in at least one of a user-accessible format or auser-readable format.
 23. A non-transitory storage medium which includesthereon a software arrangement for providing one or more microarrays,which configures a processing arrangement to perform the procedurescomprising: i) receiving information regarding a hybridization of theDNA sample to one or more microarrays to create a hybridization pattern,the one or more microarrays including a set of oligonucleotide probesthat are capable of detecting at least one set of genotypes orhaplotypes for a sequence of polymorphic genetic loci in adeoxyribonucleic acid (DNA) sample or identifying a strain variant fromthe DNA sample; ii) determining at least one of a genotype, a haplotypeor a strain variant based on the hybridization pattern; and iii)optimizing at least one of the set or an arrangement of theoligonucleotide probes as a function of at least one of a match criteriaor a mismatch criteria between a true allele contained in the DNA sampleand an allele determined from the hybridization.
 24. The storage mediumof claim 23, wherein the mismatch criteria is the following:$\begin{matrix}{\min{\sum\limits_{{type}\mspace{11mu} j}\;{w_{j}E\lfloor \prod\limits_{T_{j} \neq {\hat{T}}_{j}}\; \rfloor}}} \\ \Leftrightarrow{\min{\sum\limits_{{type}\mspace{14mu} j}\;{w_{j}{\Pr( {T_{j} \neq {\hat{T}}_{j}} )}}}} \end{matrix}.$ wherein T_(j) is a true allele contained in the DNAsample, {circumflex over (T)}_(j) is the allele determined by thehybridization step, ${\prod\limits_{x}\;{= \begin{Bmatrix}{1,} & {{if}\mspace{14mu} X\mspace{14mu}{is}\mspace{14mu}{true}} \\{0,} & {otherwise}\end{Bmatrix}}},$  and w_(j)is a weight assigned to at least one of thegenotype or the haplotype j.
 25. The storage medium of claim 24, whereinthe weights are provided as follows: w_(j)=1∀_(j), wherein ∀_(j) is aset of at least one of all known genotypes or all known haplotypes atone or more predetermined polymorphic genetic loci.
 26. The storagemedium of claim 24, wherein the weights are provided as follows: w_(j)is different for each genotype or haplotype.
 27. A system for at leastone of genotyping or haplotyping polymorphic genetic loci or strainidentification in a deoxyribonucleic acid (DNA) sample, comprising: aprocessing device which, when executed, is configured to: i) receiveinformation regarding a hybridization of the DNA sample to one or moremicroarrays to create a hybridization pattern, the one or moremicroarrays including a set of oligonucleotide probes that are capableof detecting at least one set of genotypes or haplotypes for a sequenceof polymorphic genetic loci in a deoxyribonucleic acid (DNA) sample oridentifying a strain variant from the DNA sample; ii) determine at leastone of a genotype, a haplotype or a strain variant based on thehybridization pattern; and iii) optimize at least one of the set or anarrangement of the oligonucleotide probes as a function of at least oneof a match criteria or a mismatch criteria between a true allelecontained in the DNA sample and an allele determined by thehybridization.
 28. The system of claim 27, wherein the mismatch criteriais the following: $\begin{matrix}{\min{\sum\limits_{{type}\mspace{14mu} j}{w_{j}\mspace{14mu}{E\lbrack \Pi_{T_{j} \neq {\hat{T}}_{j}}\; \rbrack}}}} \\ \Leftrightarrow{\min{\sum\limits_{{type}\mspace{14mu} j}{w_{j\mspace{11mu}}{{\Pr( {T_{j} \neq {\hat{T}}_{j}} )}.}}}} \end{matrix}$ wherein T_(j), is a true allele contained in the DNAsample, {circumflex over (T)}_(j) is the allele determined by thehybridization step, ${\prod\limits_{x}\;{= \begin{Bmatrix}{1,} & {{if}\mspace{14mu} X\mspace{14mu}{is}\mspace{14mu}{true}} \\{0,} & {otherwise}\end{Bmatrix}}},$  and w_(j)is a weight assigned to at least one of thegenotype or the haplotype j.
 29. The system of claim 28, wherein theweights are provided as follows: w_(j)=1∀_(j), wherein ∀_(j) is a set ofat least one of all known genotypes or all known haplotypes at one ormore predetermined polymorphic genetic loci.
 30. The system of claim 28,wherein the weights are provided as follows: w_(j) is different for eachgenotype or haplotype.
 31. A non-transitory computer-accessible mediumhaving stored thereon computer executable instructions for at least oneof genotyping or haplotyping a sequence of polymorphic genetic loci in adeoxyribonucleic acid (DNA) sample or identifying a strain variant fromthe DNA sample, wherein, when the executable instructions are executedby a processing arrangement, configure the processing arrangement to: i)provide one or more microarrays that include a set of oligonucleotideprobes that are capable of detecting the at least one of the genotypes,the haplotypes or the strain variant; ii) hybridize the DNA sample tothe one or more microarrays to create a hybridization pattern; and iii)determine at least one of a genotype, a haplotype or a strain variantbased on the hybridization pattern, wherein procedure (iii) produces avector of n measurements, and wherein n is a number of probes containedon the one or more microarrays.
 32. The computer-accessible medium ofclaim 31, wherein the n potential probes provided to identify N knowngenotypes or haplotypes are each associated with a response vector{right arrow over (υ)}_(j) ε{0,1}^(N), j=1, . . . ,n.
 33. Thecomputer-accessible medium of claim 32, further comprising generating agraph G on vertices corresponding to probe response vectors, wherein thegraph G is a complete edge-weighted and vertex-weighted undirected graphG=(V, E) provided on n vertices, wherein n is the number of potentialprobes.
 34. The computer-accessible medium of claim 33, wherein theweights w of each vertex v and each edge e are constrained by: 0≦w(v),w(e)≦1.
 35. The computer-accessible medium of claim 34, wherein at leastone of (i) the weight w of a vertex v is set to: w(v)=min{fraction of0's, fraction of 1's}, or (ii) the weight w of an edge e={u,v} is setto: w(e)=Hamming distance/vector length, wherein the Hamming distance ismeasured between the probe response vectors corresponding to vertices uand v, and wherein the vector length is the length of the probe responsevectors N.
 36. The computer-accessible medium of claim 33, furthercomprising modifying the graph G by thresholding the edges such that themodified graph G_(mod) is defined as G_(mod)=(V, E_(mod)), whereinE_(mod)={e ε E: w(e)≦ρ}, and ρ is a selected threshold value.
 37. Thecomputer-accessible medium of claim 36, wherein, for the modified graphG_(mod) and the probe set size M, when the executable instructions areexecuted, the processing arrangement is further configured to: iv)initialize a current-best list of independent sets with associatedinformation weights, v) initialize vertex boosting weights to vertexweights w(v), vi) defining a probability distribution on the vertexsubset based on vertex boosting weights, vii) choose a random subset ofvertices of a specified size M based on the probability distribution,viii) eliminate one of the end-point vertices in each of the edgesremaining in the induced subgraph on the random subset, ix) modify thevertex boosting weights by increasing the weights of the vertices thatare retained in the subset and decreasing the weights of the verticesthat were selected in procedure (vii) but eliminated in procedure(viii), and x) repeat procedures (vi) through (ix) for at least one of apredetermined number of iterations or until no improvement to the listof top independent sets is achieved.
 38. The computer-accessible mediumof claim 37, wherein, for the modified graph G_(mod) and the probe setsize M, when executed, the processing arrangement is further configuredto repeat procedures (v) through (x) for a predetermined number ofiterations, and to start each iteration with a reinitialization of thevertex boosting weights to the vertex weights w(v) in procedure (v). 39.The computer-accessible medium of claim 38, wherein at least one of (i)for a given fixed small 0<ε<<1, the probe set size M satisfies aninequality Pr(∀code pairs, Hamming distance≧1)>1−ε, or (ii) for a givenfixed small 0<ε<<1and a fixed α>1, the probe set size M satisfies aninequality Pr(∀code pairs, Hamming distance≧α)>1−ε.
 40. Thecomputer-accessible medium of claim 37, wherein the threshold p has avalue to enable the graph G to have a sparsity bounded by A≦sparsity≦B,wherein the sparsity is definable by the average degree of a vertex inthe graph G, wherein the lower bound A is a relatively small constant,and wherein the upper bound B is a function of the number of vertices n.41. A system for at least one of genotyping or haplotyping a sequence ofpolymorphic genetic loci in a deoxyribonucleic acid (DNA) sample oridentifying a strain variant from the DNA sample, comprising: aprocessing device that, when executed, is configured to: i) provide oneor more microarrays that include a set of oligonucleotide probes thatare capable of detecting the at least one of the genotypes, thehaplotypes or the strain variant; ii) hybridize the DNA sample to theone or more microarrays to create a hybridization pattern; and iii)determine at least one of a genotype, a haplotype or a strain variantbased on the hybridization pattern, wherein procedure (iii) produces avector of n measurements, and wherein n is a number of probes containedon the one or more microarrays.
 42. The system of claim 41, wherein then potential probes provided to identify N known genotypes or haplotypesare each associated with a response vector {right arrow over (υ)}_(j)ε{0,1}^(N), j=1, . . . ,n.
 43. The system of claim 42, furthercomprising generating a graph G on vertices corresponding to proberesponse vectors, wherein the graph G is a complete edge-weighted andvertex-weighted undirected graph G=(V, E) provided on n vertices,wherein n is the number of potential probes.
 44. The system of claim 43,wherein the weights w of each vertex v and each edge e are constrainedby: 0≦w(v), w(e)≦1.
 45. The system of claim 44, wherein at least one of(i) the weight w of a vertex v is set to: w(v)=min{fraction of 0's,fraction of 1's}, or (ii) the weight w of an edge e={u,v} is set to:w(e)=Hamming distance/vector length, wherein the Hamming distance ismeasured between the probe response vectors corresponding to vertices uand v, and wherein the vector length is the length of the probe responsevectors N.
 46. The system of claim 43, further comprising modifying thegraph G by thresholding the edges such that the modified graph G_(mod)is defined as G_(mod)=(V, E_(mod)), wherein E_(mod)={e ε E: w(e)≦ρ}, andρ is a selected threshold value.
 47. The system of claim 46, wherein,for the modified graph G_(mod) and the probe set size M, when executed,the processing device is further configured to: iv) initialize acurrent-best list of independent sets with associated informationweights, v) initialize vertex boosting weights to vertex weights w(v),vi) define a probability distribution on the vertex subset based onvertex boosting weights, vii) choose a random subset of vertices of aspecified size M based on the probability distribution, viii) eliminateone of the end-point vertices in each of the edges remaining in theinduced subgraph on the random subset, ix) modify the vertex boostingweights by increasing the weights of the vertices that are retained inthe subset and decreasing the weights of the vertices that were selectedin procedure (vii) but eliminated in procedure (viii), and x) repeatprocedures (vi) through (ix) for at least one of a predetermined numberof iterations or until no improvement to the list of top independentsets is achieved.
 48. The system of claim 47, wherein, for the modifiedgraph G_(mod) and the probe set size M, when executed, the processingdevice is further configured to repeat procedures (v) through (x) for apredetermined number of iterations, and to start each iteration with areinitialization of the vertex boosting weights to the vertex weightsw(v) in procedure (v).
 49. The computer-accessible medium of claim 48,wherein at least one of (i) for a given fixed small 0<ε<<1, the probeset size M satisfies an inequality Pr(∀code pairs, Hammingdistance≧1)>1−ε, or (ii) for a given fixed small 0<ε<<1and a fixed α>1,the probe set size M satisfies an inequality Pr(∀code pairs, Hammingdistance≧α)>1−ε.
 50. The computer-accessible medium of claim 47, whereinthe threshold p has a value to enable the graph G to have a sparsitybounded by A≦sparsity≦B, wherein the sparsity is definable by theaverage degree of a vertex in the graph G, wherein the lower bound A isa relatively small constant, and wherein the upper bound B is a functionof the number of vertices n.